Discussion:
Unnecessary healing in 3-node replication setup on reboot
(too old to reply)
Udo Giacomozzi
2015-10-15 07:26:45 UTC
Permalink
Hello everybody,

I'm new to this list, apologies if I'm asking something stupid.. ;-)

I'm using GlusterFS on three nodes as the foundation for a 3-node
high-availability Proxmox cluster. GlusterFS is mostly used to store the
HDD images of a number of VMs and is accessed via NFS.

My problem is, that every time I reboot one of the nodes, Gluster starts
healing all of the files. Since they are quite big, it takes up to
~15-30 minutes to complete. It completes successfully, but I have to be
extremely careful not to migrate VMs around because that results in
corrupted files.

I've already posted this problem in #gluster IRC channel:
http://irclog.perlgeek.de/gluster/2015-10-01#i_11302365
and apparently it is a bug that *could* have been resolved in more
recent releases of Gluster.

I'm currently running the most recent version of the Proxmox 3.4
repository (Gluster 3.5.2; based on Debian Wheezy) . Upgrading Gluster
means some work (build from source, probably) and potetial risk, so I'd
like to be sure that using Gluster 3.7 will solve this problem and not
cause any other problems.

Somebody has more detailed information about this bug? Perhaps is there
any way to work around it?

Thank you very much,
Udo
Lindsay Mathieson
2015-10-15 08:16:54 UTC
Permalink
The gluster.org debian wheezy repo installs 3.6.6 safely on px 3.4, I use it myself

Lindsay Mathieson

-----Original Message-----
From: "Udo Giacomozzi" <***@indunet.it>
Sent: ‎15/‎10/‎2015 6:34 PM
To: "gluster-***@gluster.org" <gluster-***@gluster.org>
Subject: [Gluster-users] Unnecessary healing in 3-node replication setup onreboot

Hello everybody,

I'm new to this list, apologies if I'm asking something stupid.. ;-)

I'm using GlusterFS on three nodes as the foundation for a 3-node
high-availability Proxmox cluster. GlusterFS is mostly used to store the
HDD images of a number of VMs and is accessed via NFS.

My problem is, that every time I reboot one of the nodes, Gluster starts
healing all of the files. Since they are quite big, it takes up to
~15-30 minutes to complete. It completes successfully, but I have to be
extremely careful not to migrate VMs around because that results in
corrupted files.

I've already posted this problem in #gluster IRC channel:
http://irclog.perlgeek.de/gluster/2015-10-01#i_11302365
and apparently it is a bug that *could* have been resolved in more
recent releases of Gluster.

I'm currently running the most recent version of the Proxmox 3.4
repository (Gluster 3.5.2; based on Debian Wheezy) . Upgrading Gluster
means some work (build from source, probably) and potetial risk, so I'd
like to be sure that using Gluster 3.7 will solve this problem and not
cause any other problems.

Somebody has more detailed information about this bug? Perhaps is there
any way to work around it?

Thank you very much,
Udo
Lindsay Mathieson
2015-10-16 00:25:21 UTC
Permalink
Post by Udo Giacomozzi
My problem is, that every time I reboot one of the nodes, Gluster starts
healing all of the files. Since they are quite big, it takes up to ~15-30
minutes to complete. It completes successfully, but I have to be extremely
careful not to migrate VMs around because that results in corrupted files.
Sorry meant to ask this earlier - when rebooting one node in a replica 3
gluster, then any files written to why the node is rebooting will need to
be healed. Given your files are VM running images that will be all of them.
So healing all the files sounds like the correct behaviour.
--
Lindsay
Udo Giacomozzi
2015-10-16 09:43:15 UTC
Permalink
Post by Udo Giacomozzi
My problem is, that every time I reboot one of the nodes, Gluster
starts healing all of the files. Since they are quite big, it
takes up to ~15-30 minutes to complete. It completes successfully,
but I have to be extremely careful not to migrate VMs around
because that results in corrupted files.
Sorry meant to ask this earlier - when rebooting one node in a replica
3 gluster, then any files written to why the node is rebooting will
need to be healed. Given your files are VM running images that will be
all of them. So healing all the files sounds like the correct behaviour.
Hi Lindsay,

so given the following situation:

* all VMs are running on node #1 or #2
* *no* VMs are running on node #3, so *no Gluster files touched there*
* node #3 reboots

So, in such a situation it would be normal that all Gluster files will
be healed afterwards? Given the time it takes and the network load
measured it apparently does /not /do a simple metadata check, but rather
seems to transfer the *contents* of all the files across the network.

Is that normal behavior?

Udo
Udo Giacomozzi
2015-10-16 14:26:34 UTC
Permalink
looks like correct.
during the reboot, if the vm write anything, at the end the files on
#1 and #2 will be different from thos on #3 that was down. So healing
is NECESSARY.
Ivan
Ok, I see. :-/

To me this sounds like Gluster is not really suited for big files, like
as the main storage for VMs - since they are being modified constantly.
Or am I missing something? Perhaps Gluster can be configured to heal
only modified parts of the files?

Thanks,
Udo
Lindsay Mathieson
2015-10-16 14:41:19 UTC
Permalink
To me this sounds like Gluster is not really suited for big files, like as
the main storage for VMs - since they are being modified constantly.
Depends :)

Any replicated storage will have to heal its copies if they are written to
when a node is down. So long as the files can still be read/written while
being healed and the resource usage (CPU/Network) is not to high then it
should be transparent - that's a major whole pint of a replicated
filesystem.

I'm guessing that like me, you are running your gluster storage on your VM
Hosts and you like me are a chronic tweaker, so tend to reboot the hosts
more than you should. In that case you might want to consider moving your
gluster storage to seperate dedicated nodes that you can leave up.
Or am I missing something? Perhaps Gluster can be configured to heal only
modified parts of the files?
Not that I know of.

ceph is pretty good tracking changes and only transferring them - heals
form a reboot only generally take a few minutes on my three node setup. But
it is a huge headache to set up and administer, and its I/O performance is
pretty bad on small setups (< 6 nodes, < 24 disks). But it scales really
well and really shines when you get into the hundreds of nodes and disks,
but I would not recommend it for small IT setups.
--
Lindsay
Vijay Bellur
2015-10-16 16:51:40 UTC
Permalink
Post by Udo Giacomozzi
To me this sounds like Gluster is not really suited for big files,
like as the main storage for VMs - since they are being modified
constantly.
Depends :)
Any replicated storage will have to heal its copies if they are written
to when a node is down. So long as the files can still be read/written
while being healed and the resource usage (CPU/Network) is not to high
then it should be transparent - that's a major whole pint of a
replicated filesystem.
I'm guessing that like me, you are running your gluster storage on your
VM Hosts and you like me are a chronic tweaker, so tend to reboot the
hosts more than you should. In that case you might want to consider
moving your gluster storage to seperate dedicated nodes that you can
leave up.
Or am I missing something? Perhaps Gluster can be configured to heal
only modified parts of the files?
Not that I know of.
self-healing in gluster by default syncs only modified parts of the
files from a source node. Gluster does a rolling checksum of a file
needing self-heal to identify regions of the file which need to be
synced over the network. This rolling checksum computation can sometimes
be expensive and there are plans to have a lighter self-healing in 3.8
with more granular changelogs that can do away with the need to do a
rolling checksum.

You may also want to check sharding (currently in beta with 3.7) where
large files are chunked to smaller fragments. With this scheme,
self-healing (and rolling checksum computation thereby) happens only on
those fragments that undergo changes when one of the nodes in a
replicated set is offline. This has shown nice improvements in gluster's
resource utilization during self-healing.

Regards,
Vijay
Lindsay Mathieson
2015-10-16 22:17:06 UTC
Permalink
Post by Vijay Bellur
You may also want to check sharding (currently in beta with 3.7) where
large files are chunked to smaller fragments. With this scheme,
self-healing (and rolling checksum computation thereby) happens only on
those fragments that undergo changes when one of the nodes in a replicated
set is offline. This has shown nice improvements in gluster's resource
utilization during self-healing.
Very interesting, I presume you'd have top create a new volume to test it.

Also you'd loose the ability to access the file on the host filesystem in
emergencies wouldn't you?
--
Lindsay
Vijay Bellur
2015-10-17 14:02:29 UTC
Permalink
Post by Vijay Bellur
You may also want to check sharding (currently in beta with 3.7)
where large files are chunked to smaller fragments. With this
scheme, self-healing (and rolling checksum computation thereby)
happens only on those fragments that undergo changes when one of the
nodes in a replicated set is offline. This has shown nice
improvements in gluster's resource utilization during self-healing.
Very interesting, I presume you'd have top create a new volume to test it.
Also you'd loose the ability to access the file on the host filesystem
in emergencies wouldn't you?
Right on both counts. If you are aware of the layout, the shards can be
concatenated to get back a single file. It does need some work to locate
the shards and we can possibly provide a script that can stitch shards
back to a single file.

Regards,
Vijay
Lindsay Mathieson
2015-10-16 22:45:14 UTC
Permalink
Post by Vijay Bellur
You may also want to check sharding (currently in beta with 3.7) where
large files are chunked to smaller fragments. With this scheme,
self-healing (and rolling checksum computation thereby) happens only on
those fragments that undergo changes when one of the nodes in a replicated
set is offline. This has shown nice improvements in gluster's resource
utilization during self-healing.
Does it effect read speed and random i/o? I guess that would depend on the
methodology used to calculate shard location for a given block. Could be
quite interesting on top of zfs, love to test.
--
Lindsay
Vijay Bellur
2015-10-17 14:17:18 UTC
Permalink
Post by Vijay Bellur
You may also want to check sharding (currently in beta with 3.7)
where large files are chunked to smaller fragments. With this
scheme, self-healing (and rolling checksum computation thereby)
happens only on those fragments that undergo changes when one of the
nodes in a replicated set is offline. This has shown nice
improvements in gluster's resource utilization during self-healing.
Does it effect read speed and random i/o? I guess that would depend on
the methodology used to calculate shard location for a given block.
Could be quite interesting on top of zfs, love to test.
Krutika has been working on several performance improvements for
sharding and the results have been encouraging for virtual machine
workloads.

Testing feedback would be very welcome!

Thanks,
Vijay
Lindsay Mathieson
2015-10-17 14:44:26 UTC
Permalink
Krutika has been working on several performance improvements for sharding
and the results have been encouraging for virtual machine workloads.
Testing feedback would be very welcome!
Got to upgrade my cluster to jessie first :( Non trivial.

Is sharding a definite feature now?
--
Lindsay
Vijay Bellur
2015-10-18 05:16:38 UTC
Permalink
Post by Vijay Bellur
Krutika has been working on several performance improvements for
sharding and the results have been encouraging for virtual machine
workloads.
Testing feedback would be very welcome!
Got to upgrade my cluster to jessie first :( Non trivial.
Is sharding a definite feature now?
Yes, we will be looking to have it out of beta for virtual machine image
storage.

Regards,
Vijay

Udo Giacomozzi
2015-10-17 15:38:17 UTC
Permalink
Post by Vijay Bellur
self-healing in gluster by default syncs only modified parts of the
files from a source node. Gluster does a rolling checksum of a file
needing self-heal to identify regions of the file which need to be
synced over the network. This rolling checksum computation can
sometimes be expensive and there are plans to have a lighter
self-healing in 3.8 with more granular changelogs that can do away
with the need to do a rolling checksum.
I did some tests (see below) - could you please check this and tell me
if this is normal?


For example, I have a 200GB VM disk image in my volume (the biggest
file). About 75% of that disk is currently unused space and writes are
only about 50 kbytes/sec.
That 200GB disk image /always/ "heals" a very long time (at least 30
minutes) - even if I'm pretty sure only a few blocks could have been
changed.


Anyway, I just rebooted a node (about 2-3 minutes downtime) to collect
some information:

* In total I have about 790GB* files in that Gluster volume
* about 411GB* belong to active VM HDD images, the remaining are
backup/template files
* only VM HDD images are being healed (max 15 files)
* while healing, glusterfsd shows varying CPU usages between 70% and
650% (it's a 16 cores server); total 106 minutes CPU time once
healing completed
* once healing completes, the machine received a total of 7.0 GB and
sent 3.6 GB over the internal network (so, yes, you're right that
not all contents are transferred)
* *total heal time: whopping 58 minutes*

/* these are summed up file sizes; "du" and "df" commands show smaller usage

/Node details (all 3 nodes are identical):/
/

* DELL PowerEdge R730
* Intel Xeon E5-2600 @ 2.4GHz
* 64 GB DDR4 RAM
* the server is able to gzip-compress about 1 GB data / second (all
cores together)
* 3 TB HW-RAID10 HDD (2.7TB reserved for Gluster); minimum 500 MB/s
write speed, 350 MB/s read speed
* redundant 1GBit/s internal network
* Debian 7 Wheezy / Proxmox 3.4, Kernel 2.6.32, Gluster 3.5.2

Volume setup:/
/

# gluster volume info systems

Volume Name: systems
Type: Replicate
Volume ID: b2d72784-4b0e-4f7b-b858-4ec59979a064
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: metal1:/data/gluster/systems
Brick2: metal2:/data/gluster/systems
Brick3: metal3:/data/gluster/systems
Options Reconfigured:
cluster.server-quorum-ratio: 51%

/Note that `//gluster volume heal "systems" info//` takes 3-10 seconds
to complete during heal - I hope that doesn't slow down healing since I
tend to run that command frequently./


Would you expect these results or is something wrong?

Would upgrading to Gluster 3.6 or 3.7 improve healing performance?

Thanks,
Udo
Lindsay Mathieson
2015-10-16 14:27:56 UTC
Permalink
So, in such a situation it would be normal that all Gluster files will be
healed afterwards? Given the time it takes and the network load measured it
apparently does *not *do a simple metadata check, but rather seems to
transfer the *contents* of all the files across the network.
Is that normal behavior?
Any file that is written to when a brick is down has to be healed when the
brick is back, i.e the missing/changed data has to written to the brick. I
presumed when you took a node down you migrated you're VM's to another
node, so all VM's would be running.

Depending on your heal settings this can be a complete content transfer of
the file.

Some one feel free to correct me on this, but gluster doesn't not track
which sectors are dirty, so it has two strategies it can emply:

- Diff: Compute and compare a checksum each sector/block between separate
replicas. (CPU Intensive)

- Full. Just copy the entire file across (Network intensive).
--
Lindsay
Loading...