Discussion:
Sharding - what next?
(too old to reply)
Krutika Dhananjay
2015-12-03 10:34:35 UTC
Permalink
Hi,

When we designed and wrote sharding feature in GlusterFS, our focus had been
single-writer-to-large-files use cases, chief among these being the virtual machine image store use-case.
Sharding, for the uninitiated, is a feature that was introduced in glusterfs-3.7.0 release with 'experimental' status.
Here is some documentation that explains what it does at a high level:
http://www.gluster.org/community/documentation/index.php/Features/sharding-xlator
https://gluster.readthedocs.org/en/release-3.7.0/Features/shard/

We have now reached that stage where the feature is considered stable for the VM store use case
after several rounds of testing (thanks to Lindsay Mathieson, Paul Cuzner and Satheesaran Sundaramoorthi),
bug fixing and reviews (thanks to Pranith Karampuri). Also in this regard, patches have been sent to make
sharding work with geo-replication, thanks to Kotresh's efforts (testing still in progress).

We would love to hear from you on what you think of the feature and where it could be improved.
Specifically, the following are the questions we are seeking feedback on:
a) your experience testing sharding with VM store use-case - any bugs you ran into, any performance issues, etc
b) what are the other large-file use-cases (apart from the VM store workload) you know of or use,
where you think having sharding capability will be useful.

Based on your feedback we will start work on making sharding work in other workloads and/or with other existing GlusterFS features.

Thanks,
Krutika
Lindsay Mathieson
2015-12-09 13:18:40 UTC
Permalink
Hi Guys, sorry for the late reply, my attention tends to be somewhat
sporadic due to work and the large number of rescue dogs/cats I care for :)
Post by Krutika Dhananjay
We would love to hear from you on what you think of the feature and
where it could be improved.
a) your experience testing sharding with VM store use-case - any bugs
you ran into, any performance issues, etc
Testing was initially somewhat stressful as I regularly encountered file
corruption. However I don't think that was due to bugs, rather incorrect
settings for the VM usecase. Once I got that sorted out it has been very
stable - I have really stressed failure modes we run into at work -
nodes going down while heavy writes were happening. Live migrations
during heals. gluster software being killed while VM were running on the
host. So far its held up without a hitch.

To that end, one thing I think should be made more obvious is the
settings required for VM Hosting:

quick-read=off
read-ahead=off
io-cache=off
stat-prefetch=off
eager-lock=enable
remote-dio=enable
quorum-type=auto
server-quorum-type=server

They are quite crucial and very easy to miss in the online docs. And
they are only recommended with noo mention that you will corrupt KVM
VM's if you live migrate them between gluster nodes without them set.
Also the virt group is missing from the debian packages.

Setting them does seem to have slowed sequential writes by about 10% but
I need to test that more.


Something related - sharding is useful because it makes heals much more
granular and hence faster. To that end it would be really useful if
there was a heal info variant that gave a overview of the process -
rather than list the shards that are being healed, just a aggregate
total, e.g.

$ gluster volume heal datastore1 status
volume datastore1
- split brain: 0
- Wounded:65
- healing:4

It gives one a easy feeling of progress - heals aren't happening faster,
but it would feel that way :)


Also, it would be great if the heal info command could return faster,
sometimes it takes over a minute.

Thanks for the great work,

Lindsay
Krutika Dhananjay
2015-12-10 07:33:34 UTC
Permalink
----- Original Message -----
Sent: Wednesday, December 9, 2015 6:48:40 PM
Subject: Re: Sharding - what next?
Hi Guys, sorry for the late reply, my attention tends to be somewhat sporadic
due to work and the large number of rescue dogs/cats I care for :)
Post by Krutika Dhananjay
We would love to hear from you on what you think of the feature and where
it
could be improved.
a) your experience testing sharding with VM store use-case - any bugs you
ran
into, any performance issues, etc
Testing was initially somewhat stressful as I regularly encountered file
corruption. However I don't think that was due to bugs, rather incorrect
settings for the VM usecase. Once I got that sorted out it has been very
stable - I have really stressed failure modes we run into at work - nodes
going down while heavy writes were happening. Live migrations during heals.
gluster software being killed while VM were running on the host. So far its
held up without a hitch.
To that end, one thing I think should be made more obvious is the settings
Post by Krutika Dhananjay
quick-read=off
read-ahead=off
io-cache=off
stat-prefetch=off
eager-lock=enable
remote-dio=enable
quorum-type=auto
server-quorum-type=server
They are quite crucial and very easy to miss in the online docs. And they are
only recommended with noo mention that you will corrupt KVM VM's if you live
migrate them between gluster nodes without them set. Also the virt group is
missing from the debian packages.
Hi Lindsay,
Thanks for the feedback. I will get in touch with Humble to find out what can be done about the docs.
Setting them does seem to have slowed sequential writes by about 10% but I
need to test that more.
Something related - sharding is useful because it makes heals much more
granular and hence faster. To that end it would be really useful if there
was a heal info variant that gave a overview of the process - rather than
list the shards that are being healed, just a aggregate total, e.g.
$ gluster volume heal datastore1 status
volume datastore1
- split brain: 0
- Wounded:65
- healing:4
It gives one a easy feeling of progress - heals aren't happening faster, but
it would feel that way :)
There is a 'heal-info summary' command that is under review, written by Mohammed Ashiq @ http://review.gluster.org/#/c/12154/3 which prints the number of files that are yet to be healed.
It could perhaps be enhanced to print files in split-brain and also files which are possibly being healed. Note that these counts are printed per brick.
It does not print a single list of counts with aggregated values. Would that be something you would consider useful?
Also, it would be great if the heal info command could return faster,
sometimes it takes over a minute.
Yeah, I think part of the problem could be eager-lock feature which is causing the GlusterFS client process to not relinquish the network lock on the file soon enough, causing the heal info utility to be blocked for longer duration.
There is an enhancement Anuradha Talur is working on where heal-info would do away with taking locks altogether. Once that is in place, heal-info should return faster.

-Krutika
Thanks for the great work,
Lindsay
Lindsay Mathieson
2015-12-16 01:26:03 UTC
Permalink
Hi, late reply again ...
Post by Krutika Dhananjay
There is a 'heal-info summary' command that is under review, written
the number of files that are yet to be healed.
It could perhaps be enhanced to print files in split-brain and also
files which are possibly being healed. Note that these counts are
printed per brick.
It does not print a single list of counts with aggregated values.
Would that be something you would consider useful?
Very much so, that would be perfect.

I can get close to this just with the following

gluster volume heal datastore1 info | grep 'Brick\|Number'


And if one is feeling fancy or just wants to keep an eye on progress

watch "gluster volume heal datastore1 info | grep 'Brick\|Number'"

though of course this runs afoul of the heal info delay.
Post by Krutika Dhananjay
Also, it would be great if the heal info command could return
faster, sometimes it takes over a minute.
Yeah, I think part of the problem could be eager-lock feature which is
causing the GlusterFS client process to not relinquish the network
lock on the file soon enough, causing the heal info utility to be
blocked for longer duration.
There is an enhancement Anuradha Talur is working on where heal-info
would do away with taking locks altogether. Once that is in place,
heal-info should return faster.
Excellent, I look fwd to that. Even if removing the locks results in the
occasional inaccurate cout, I don't think that would mattter - From my
POV its an indicator, not a absolute.

Thanks,
--
Lindsay Mathieson
Krutika Dhananjay
2015-12-16 12:59:29 UTC
Permalink
----- Original Message -----
Sent: Wednesday, December 16, 2015 6:56:03 AM
Subject: Re: Sharding - what next?
Hi, late reply again ...
Post by Krutika Dhananjay
There is a 'heal-info summary' command that is under review, written by
number of files that are yet to be healed.
It could perhaps be enhanced to print files in split-brain and also files
which are possibly being healed. Note that these counts are printed per
brick.
It does not print a single list of counts with aggregated values. Would
that
be something you would consider useful?
Very much so, that would be perfect.
I can get close to this just with the following
gluster volume heal datastore1 info | grep 'Brick\|Number'
And if one is feeling fancy or just wants to keep an eye on progress
watch "gluster volume heal datastore1 info | grep 'Brick\|Number'"
though of course this runs afoul of the heal info delay.
I guess I did not make myself clear. Apologies. I meant to say that printing a single list of counts aggregated
from all bricks can be tricky and is susceptible to the possibility of same entry getting counted multiple times
if the inode needs a heal on multiple bricks. Eliminating such duplicates would be rather difficult.

Or, we could have a sub-command of heal-info dump all the file paths/gfids that need heal from all bricks and
you could pipe the output to 'sort | uniq | wc -l' to eliminate duplicates. Would that be OK? :)

-Krutika
Post by Krutika Dhananjay
Post by Lindsay Mathieson
Also, it would be great if the heal info command could return faster,
sometimes it takes over a minute.
Yeah, I think part of the problem could be eager-lock feature which is
causing the GlusterFS client process to not relinquish the network lock on
the file soon enough, causing the heal info utility to be blocked for
longer
duration.
There is an enhancement Anuradha Talur is working on where heal-info would
do
away with taking locks altogether. Once that is in place, heal-info should
return faster.
Excellent, I look fwd to that. Even if removing the locks results in the
occasional inaccurate cout, I don't think that would mattter - From my POV
its an indicator, not a absolute.
Thanks,
--
Lindsay Mathieson
Lindsay Mathieson
2015-12-16 23:54:31 UTC
Permalink
Post by Krutika Dhananjay
I guess I did not make myself clear. Apologies. I meant to say that
printing a single list of counts aggregated
from all bricks can be tricky and is susceptible to the possibility of
same entry getting counted multiple times
if the inode needs a heal on multiple bricks. Eliminating such
duplicates would be rather difficult.
Or, we could have a sub-command of heal-info dump all the file
paths/gfids that need heal from all bricks and
you could pipe the output to 'sort | uniq | wc -l' to eliminate
duplicates. Would that be OK? :)
Sorry, my fault - I did understand that. Aggregate counts per brick
would be fine, I have no desire to complicate things for the devs :)
--
Lindsay Mathieson
Continue reading on narkive:
Loading...