Discussion:
Shard Volume testing (3.7.5)
(too old to reply)
Lindsay Mathieson
2015-10-25 06:29:17 UTC
Permalink
Krutika has been working on several performance improvements for sharding
and the results have been encouraging for virtual machine workloads.
Testing feedback would be very welcome!
I've managed to setup a replica 3 3.7.5 shard test volume, hosted using
virtualised debian 8.2 servers, so performance is a bit crap :)

3 Nodes, gn1, hn2 & gn3
Each node has:
- 1GB RAM
- 1GB Ethernet
- 512 GB disk hosted on a ZFS External USB Drive :)

- Datastore is shared out via NFS to the main cluster for running a VM
- I have the datastore mounted using glusterfs inside each test node so I
can examine the data directly.



I've got two VM's running off it, one a 65GB (25GB sparse) Windows 7. I'be
running bench marks and testing node failures by killing the cluster
processes and killing actual nodes.

- Heal speed is immensely faster, a matter of minutes rather than hours.
- Read performance is quite good
- Write performance is atrocious, but given the limited resources not
unexpected.
- I'll be upgrading my main cluster to jessie soon and will be able to test
with real hardware and bonded connections, plus using gfapi direct. Then
I'll be able to do real benchmarks.

One Bug:
After heals completed I shut down the VM's and run a MD5SUM on the VM image
(via glusterfs) on each nodes. They all matched except for one time on gn3.
Once I unmounted/remounted the datastore on gn3 the md5sum matched.

One Oddity:
gluster volume heals datastore info *always* shows a split brain on the
directory, but it always heals without intervention. Dunno if this is
normal on not.

Questions:
- I'd be interested to know how the shard are organsied and accessed - it
looks like 1000's of 4mb files in the .shard directory, I'm concerned
access times will go in the toilet once many large VM images are stored on
the volume.

- Is it worth experimenting with different shard sizes?

- Anything you'd like me to test?

Thanks,
--
Lindsay
Krutika Dhananjay
2015-10-26 04:54:50 UTC
Permalink
----- Original Message -----
Sent: Sunday, October 25, 2015 11:59:17 AM
Subject: [Gluster-users] Shard Volume testing (3.7.5)
Krutika has been working on several performance improvements for sharding
and
the results have been encouraging for virtual machine workloads.
Testing feedback would be very welcome!
Hi Lindsay,

Thank you for trying out sharding and for your feedback. :) Please find my comments inline.
I've managed to setup a replica 3 3.7.5 shard test volume, hosted using
virtualised debian 8.2 servers, so performance is a bit crap :)
3 Nodes, gn1, hn2 & gn3
- 1GB RAM
- 1GB Ethernet
- 512 GB disk hosted on a ZFS External USB Drive :)
- Datastore is shared out via NFS to the main cluster for running a VM
- I have the datastore mounted using glusterfs inside each test node so I can
examine the data directly.
I've got two VM's running off it, one a 65GB (25GB sparse) Windows 7. I'be
running bench marks and testing node failures by killing the cluster
processes and killing actual nodes.
- Heal speed is immensely faster, a matter of minutes rather than hours.
- Read performance is quite good
Good to hear. :)
- Write performance is atrocious, but given the limited resources not
unexpected.
With block size as low as 4MB, to the replicate module, these individual shards appear as large number of small(er) files, effectively turning it into some form of a small-file workload.
There is an enhancement being worked on in AFR by Pranith, which attempts to improve write performance which will especially be useful when used with sharding. That should make this problem go away.
- I'll be upgrading my main cluster to jessie soon and will be able to test
with real hardware and bonded connections, plus using gfapi direct. Then
I'll be able to do real benchmarks.
After heals completed I shut down the VM's and run a MD5SUM on the VM image
(via glusterfs) on each nodes. They all matched except for one time on gn3.
Once I unmounted/remounted the datastore on gn3 the md5sum matched.
This could possibly be the effect of a caching bug reported at https://bugzilla.redhat.com/show_bug.cgi?id=1272986 . The fix is out for review and I'm confident that it will make it into 3.7.6.
gluster volume heals datastore info *always* shows a split brain on the
directory, but it always heals without intervention. Dunno if this is normal
on not.
Which directory would this be? Do you have the glustershd logs?
- I'd be interested to know how the shard are organsied and accessed - it
looks like 1000's of 4mb files in the .shard directory, I'm concerned access
times will go in the toilet once many large VM images are stored on the
volume.
Here is some documentation on sharding: https://gluster.readthedocs.org/en/release-3.7.0/Features/shard/ . Let me know if you have more questions, and I will be happy to answer them.
The problems we foresaw with too many 4MB shards is that
i. entry self-heal under /.shard could result in complete crawl of the /.shard directory during heal, or
ii. a disk replacement could involve lot many files needing to be created and healed to the sink brick,
both of which would result in slower "entry" heal and rather high resource consumption from self-heal daemon.
Fortunately, with the introduction of more granular changelogs in replicate module to identify exactly what files under a given directory need to be healed to the sink brick, these problems should go away.
In fact this enhancement is being worked upon as we speak and is targeted to be out by 3.8. Here is some doc: http://review.gluster.org/#/c/12257/1/in_progress/afr-self-heal-improvements.md (read section "Granular entry self-heals").
- Is it worth experimenting with different shard sizes?
Sure! You could use 'gluster volume set <VOL> features.shard-block-size <size>' to reconfigure the shard size. The new size will be used to shard those files/images/vdisks that are created _after_ the block size was reconfigured.
- Anything you'd like me to test?
Yes. So Paul Cuzner and Satheesaran who have been testing sharding here have reported better write performance with 512M shards. I'd be interested to know what you feel about performance with relatively larger shards (think 512M).

-Krutika
Thanks,
--
Lindsay
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Lindsay Mathieson
2015-10-27 11:47:09 UTC
Permalink
Post by Krutika Dhananjay
Hi Lindsay,
Thank you for trying out sharding and for your feedback. :) Please find my comments inline.
Hi Krutika, thanks for the feed back.
Post by Krutika Dhananjay
With block size as low as 4MB, to the replicate module, these individual
shards appear as large number of small(er) files, effectively turning it
into some form of a small-file workload.
There is an enhancement being worked on in AFR by Pranith, which attempts
to improve write performance which will especially be useful when used with
sharding. That should make this problem go away.
Cool, also for my purposes (VM Image hosting), block sizes of 512MB are
just as good and improve things considerably.
Post by Krutika Dhananjay
After heals completed I shut down the VM's and run a MD5SUM on the VM
image (via glusterfs) on each nodes. They all matched except for one time
on gn3. Once I unmounted/remounted the datastore on gn3 the md5sum matched.
This could possibly be the effect of a caching bug reported at
https://bugzilla.redhat.com/show_bug.cgi?id=1272986. The fix is out for
review and I'm confident that it will make it into 3.7.6.
Cool, I can replicate it fairly reliable at the moment.

Would it occur when using qemu/gfapi direct?
Post by Krutika Dhananjay
gluster volume heals datastore info *always* shows a split brain on the
directory, but it always heals without intervention. Dunno if this is
normal on not.
Which directory would this be?
Oddly it was the .shard directory
Post by Krutika Dhananjay
Do you have the glustershd logs?
Sorry no, and I haven't managed to replicate it again. Will keep trying.
Post by Krutika Dhananjay
https://gluster.readthedocs.org/en/release-3.7.0/Features/shard/. Let me
know if you have more questions, and I will be happy to answer them.
The problems we foresaw with too many 4MB shards is that
i. entry self-heal under /.shard could result in complete crawl of the
/.shard directory during heal, or
ii. a disk replacement could involve lot many files needing to be created
and healed to the sink brick,
both of which would result in slower "entry" heal and rather high resource
consumption from self-heal daemon.
Thanks, most interesting reading.
Post by Krutika Dhananjay
Fortunately, with the introduction of more granular changelogs in
replicate module to identify exactly what files under a given directory
need to be healed to the sink brick, these problems should go away.
In fact this enhancement is being worked upon as we speak and is targeted
http://review.gluster.org/#/c/12257/1/in_progress/afr-self-heal-improvements.md
(read section "Granular entry self-heals").
That look very interesting - in fact from my point of view, it replaces the
need for sharding altogether, that being the speed of heals.
Post by Krutika Dhananjay
Yes. So Paul Cuzner and Satheesaran who have been testing sharding here
have reported better write performance with 512M shards. I'd be interested
to know what you feel about performance with relatively larger shards
(think 512M).
Seq Read speeds basically tripled, and seq writes improved to the limit of
the network connection.

Cheers,
--
Lindsay
Krutika Dhananjay
2015-10-28 07:03:43 UTC
Permalink
----- Original Message -----
Sent: Tuesday, October 27, 2015 5:17:09 PM
Subject: Re: [Gluster-users] Shard Volume testing (3.7.5)
Post by Krutika Dhananjay
Hi Lindsay,
Thank you for trying out sharding and for your feedback. :) Please find my
comments inline.
Hi Krutika, thanks for the feed back.
Post by Krutika Dhananjay
With block size as low as 4MB, to the replicate module, these individual
shards appear as large number of small(er) files, effectively turning it
into some form of a small-file workload.
There is an enhancement being worked on in AFR by Pranith, which attempts
to
improve write performance which will especially be useful when used with
sharding. That should make this problem go away.
Cool, also for my purposes (VM Image hosting), block sizes of 512MB are just
as good and improve things considerably.
Post by Krutika Dhananjay
Post by Lindsay Mathieson
After heals completed I shut down the VM's and run a MD5SUM on the VM image
(via glusterfs) on each nodes. They all matched except for one time on gn3.
Once I unmounted/remounted the datastore on gn3 the md5sum matched.
This could possibly be the effect of a caching bug reported at
https://bugzilla.redhat.com/show_bug.cgi?id=1272986 . The fix is out for
review and I'm confident that it will make it into 3.7.6.
Cool, I can replicate it fairly reliable at the moment.
Would it occur when using qemu/gfapi direct?
Post by Krutika Dhananjay
Post by Lindsay Mathieson
gluster volume heals datastore info *always* shows a split brain on the
directory, but it always heals without intervention. Dunno if this is normal
on not.
Which directory would this be?
Oddly it was the .shard directory
Post by Krutika Dhananjay
Do you have the glustershd logs?
Sorry no, and I haven't managed to replicate it again. Will keep trying.
Post by Krutika Dhananjay
https://gluster.readthedocs.org/en/release-3.7.0/Features/shard/ . Let me
know if you have more questions, and I will be happy to answer them.
The problems we foresaw with too many 4MB shards is that
i. entry self-heal under /.shard could result in complete crawl of the
/.shard directory during heal, or
ii. a disk replacement could involve lot many files needing to be created
and
healed to the sink brick,
both of which would result in slower "entry" heal and rather high resource
consumption from self-heal daemon.
Thanks, most interesting reading.
Post by Krutika Dhananjay
Fortunately, with the introduction of more granular changelogs in replicate
module to identify exactly what files under a given directory need to be
healed to the sink brick, these problems should go away.
In fact this enhancement is being worked upon as we speak and is targeted
to
http://review.gluster.org/#/c/12257/1/in_progress/afr-self-heal-improvements.md
(read section "Granular entry self-heals").
That look very interesting - in fact from my point of view, it replaces the
need for sharding altogether, that being the speed of heals.
So sharding also helps with better disk utilization in distributed-replicated volumes for large files (like VM images).
So if you have a 2x3 volume with each brick having 10G space (say), even though the aggregated size of the volume (due to the presence of distribute) is 20G, without sharding you cannot create an image whose size is between 11G-20G on the volume.
With sharding, breaking large files into smaller pieces will ensure better utilisation of available space.
There are other long-term benefits one could reap from using sharding: for instance, for someone who might want to use tiering in VM store use-case, having sharding will be beneficial in terms of only migrating the shards between hot and cold tiers, as opposed to moving large files in full, even if only a small portion of the file is changed/accessed. :)
Post by Krutika Dhananjay
Yes. So Paul Cuzner and Satheesaran who have been testing sharding here
have
reported better write performance with 512M shards. I'd be interested to
know what you feel about performance with relatively larger shards (think
512M).
Seq Read speeds basically tripled, and seq writes improved to the limit of
the network connection.
OK. And what about the data heal performance with 512M shards? Satisfactory?

-Krutika
Cheers,
--
Lindsay
Lindsay Mathieson
2015-10-28 07:38:33 UTC
Permalink
Post by Krutika Dhananjay
So sharding also helps with better disk utilization in
distributed-replicated volumes for large files (like VM images).
..
There are other long-term benefits one could reap from using sharding: for
Post by Krutika Dhananjay
instance, for someone who might want to use tiering in VM store use-case,
having sharding will be beneficial in terms of only migrating the shards
between hot and cold tiers, as opposed to moving large files in full, even
if only a small portion of the file is changed/accessed. :)
Interesting points, thanks.
Post by Krutika Dhananjay
Post by Krutika Dhananjay
Yes. So Paul Cuzner and Satheesaran who have been testing sharding here
have reported better write performance with 512M shards. I'd be interested
to know what you feel about performance with relatively larger shards
(think 512M).
Seq Read speeds basically tripled, and seq writes improved to the limit of
the network connection.
OK. And what about the data heal performance with 512M shards?
Satisfactory?
Easily satisfactory, a bit slower than the 4MB shard but still way faster
than a full multi GB file heal :)


Something I have noticed, is that the heal info (gluster volume heal
<datastore> info) can be very slow to return, as in many 10's of seconds -
is there a way to speed that up?

It would be every useful if there was a command that quickly gave
summary/progress status, e.g "There are <X> shards to be healed"
--
Lindsay
Krutika Dhananjay
2015-10-28 15:02:09 UTC
Permalink
----- Original Message -----
Sent: Wednesday, October 28, 2015 1:08:33 PM
Subject: Re: [Gluster-users] Shard Volume testing (3.7.5)
Post by Krutika Dhananjay
So sharding also helps with better disk utilization in
distributed-replicated
volumes for large files (like VM images).
..
There are other long-term benefits one could reap from using sharding: for
instance, for someone who might want to use tiering in VM store use-case,
having sharding will be beneficial in terms of only migrating the shards
between hot and cold tiers, as opposed to moving large files in full, even
if only a small portion of the file is changed/accessed. :)
Interesting points, thanks.
Post by Krutika Dhananjay
Post by Lindsay Mathieson
Post by Krutika Dhananjay
Yes. So Paul Cuzner and Satheesaran who have been testing sharding here
have
reported better write performance with 512M shards. I'd be interested to
know what you feel about performance with relatively larger shards (think
512M).
Seq Read speeds basically tripled, and seq writes improved to the limit of
the network connection.
OK. And what about the data heal performance with 512M shards? Satisfactory?
Easily satisfactory, a bit slower than the 4MB shard but still way faster
than a full multi GB file heal :)
Something I have noticed, is that the heal info (gluster volume heal
<datastore> info) can be very slow to return, as in many 10's of seconds -
is there a way to speed that up?
With sharding? Or even otherwise? Approximately how many entries did the command list when you found it to be slow?
On a related note, Anuradha (cc'd) is working on an enhancement that would make the 'heal info' reporting faster. She should be able to tell you more about it.
It would be every useful if there was a command that quickly gave
summary/progress status, e.g "There are <X> shards to be healed"
Hmmm ... that would have to be an extension of 'heal info' or perhaps post-processing of the 'heal info' output which would group the different shards of a given file that need heal together. Nice suggestion. I will think about it.

-Krutika
--
Lindsay
Anuradha Talur
2015-10-29 05:50:07 UTC
Permalink
----- Original Message -----
Sent: Wednesday, October 28, 2015 1:08:33 PM
Subject: Re: [Gluster-users] Shard Volume testing (3.7.5)
So sharding also helps with better disk utilization in distributed-replicated
volumes for large files (like VM images).
..
There are other long-term benefits one could reap from using sharding: for
instance, for someone who might want to use tiering in VM store use-case,
having sharding will be beneficial in terms of only migrating the shards
between hot and cold tiers, as opposed to moving large files in full, even
if only a small portion of the file is changed/accessed. :)
Interesting points, thanks.
Yes. So Paul Cuzner and Satheesaran who have been testing sharding here have
reported better write performance with 512M shards. I'd be interested to
know what you feel about performance with relatively larger shards (think
512M).
Seq Read speeds basically tripled, and seq writes improved to the limit of
the network connection.
OK. And what about the data heal performance with 512M shards? Satisfactory?
Easily satisfactory, a bit slower than the 4MB shard but still way faster
than a full multi GB file heal :)
Something I have noticed, is that the heal info (gluster volume heal
<datastore> info) can be very slow to return, as in many 10's of seconds -
is there a way to speed that up?
Yes, there is a way to speed it up. Basically the process of finding out
whether a file needs heal or not takes some time, leading to slow heal info.
This decision making can be done in a faster way. I'm working on the approach
and will send a patch in the coming days.
It would be every useful if there was a command that quickly gave
summary/progress status, e.g "There are <X> shards to be healed"
--
Lindsay
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Thanks,
Anuradha.
Lindsay Mathieson
2015-10-29 06:38:38 UTC
Permalink
Post by Anuradha Talur
Yes, there is a way to speed it up. Basically the process of finding out
whether a file needs heal or not takes some time, leading to slow heal info.
This decision making can be done in a faster way. I'm working on the approach
and will send a patch in the coming days.
Thanks, look fwd to it
--
Lindsay
Continue reading on narkive:
Loading...