Discussion:
Freezing during heal
(too old to reply)
Kevin Lemonnier
2016-04-15 19:27:43 UTC
Permalink
Hi,

We have a small glusterFS 3.7.6 cluster with 3 nodes running with proxmox VM's on it. I did set up the different recommended option like the virt group, but
by hand since it's on debian. The shards are 256MB, if that matters.

This morning the second node crashed, and as it came back up started a heal, but that basically froze all the VM's running on that volume. Since we really really
can't have 40 minutes down time in the middle of the day, I just removed the node from the network and that stopped the heal, allowing the VM's to access
their disks again. The plan was to re-connecte the node in a couple of hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to freez trying to access the disks.

Looking at the heal info for the volume, it has gone way up since this morning, it looks like the VM's aren't writing to both nodes, just the one they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the volume to work just fine since it has quorum. What am I missing ?

It is still too early to start the heal, is there a way to start the VM anyway right now ? I mean, it was running a moment ago so the data is there, it just needs
to let the VM access it.



Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
Options Reconfigured:
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%


Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Lindsay Mathieson
2016-04-15 21:46:03 UTC
Permalink
Post by Kevin Lemonnier
Looking at the heal info for the volume, it has gone way up since this morning, it looks like the VM's aren't writing to both nodes, just the one they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the volume to work just fine since it has quorum. What am I missing ?
Testing exact same setup myself (Proxmox, Shards, 3 nodes).


Could you post your gluster voulme status as well?

Are you monitoring with a ongoing "gluster volume heal <DS> info"?
because I've encountered an issue where "heal info" will freeze I/O on
the cluster, giving the symptoms you describe. Kill off all "gluster
heal info'"s on all nodes.

Also if you can,
- shutdown all VM's,
- stop the volume (gluster volume stop<DS>)
- stop the gluster service
- unmount all gluster mounts
- should eb no more gluster processes

- bring everything back up
--
Lindsay Mathieson
Krutika Dhananjay
2016-04-17 15:56:37 UTC
Permalink
Could you share the client logs and information about the approx time/day
when you saw this issue?

-Krutika
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running with proxmox
VM's on it. I did set up the different recommended option like the virt
group, but
by hand since it's on debian. The shards are 256MB, if that matters.
This morning the second node crashed, and as it came back up started a
heal, but that basically froze all the VM's running on that volume. Since
we really really
can't have 40 minutes down time in the middle of the day, I just removed
the node from the network and that stopped the heal, allowing the VM's to
access
their disks again. The plan was to re-connecte the node in a couple of
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to freez trying
to access the disks.
Looking at the heal info for the volume, it has gone way up since this
morning, it looks like the VM's aren't writing to both nodes, just the one
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the volume to
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to start the VM
anyway right now ? I mean, it was running a moment ago so the data is
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Kevin Lemonnier
2016-04-17 16:07:22 UTC
Permalink
I believe Proxmox is just an interface to KVM that uses the lib, so if I'm not mistaken there isn't client logs ?

It's not the first time I have the issue, it happens on every heal on the 2 clusters I have.

I did let the heal finish that night and the VMs are working now, but it is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2 bricks on 3 aren't keeping the filesystem usable but might make the healing quicker right ?

Thanks
Post by Krutika Dhananjay
Could you share the client logs and information about the approx time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running with
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like the
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that matters.
This morning the second node crashed, and as it came back up started
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that volume.
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing the
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a couple
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to freez
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up since
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes, just
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to start the
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the data is
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma brièveté.
Krutika Dhananjay
2016-04-18 06:58:28 UTC
Permalink
Sorry, I was referring to the glusterfs client logs.

Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log

-Krutika
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib, so if I'm
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal on the
2 clusters I have.
I did let the heal finish that night and the VMs are working now, but it
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2 bricks
on 3 aren't keeping the filesystem usable but might make the healing
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running with
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like the
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that matters.
This morning the second node crashed, and as it came back up started
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that volume.
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing the
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a couple
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to freez
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up since
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes, just
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to start the
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the data is
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma briÚveté.
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Kevin Lemonnier
2016-04-18 07:10:16 UTC
Permalink
Yes, but as I was saying I don't believe KVM is using a mount point, I think it uses
the API (http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt).
Might be mistaken ofcourse. Proxmox does have a mountpoint for conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2) rebooted, and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal would go well but everything went down
like in the morning so I disconnected it again, and waited 11pm (23:00) to reconnect it and let it finish.

Thanks for your help,
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib, so if I'm
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal on the
2 clusters I have.
I did let the heal finish that night and the VMs are working now, but it
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2 bricks
on 3 aren't keeping the filesystem usable but might make the healing
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running with
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like the
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that matters.
This morning the second node crashed, and as it came back up started
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that volume.
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing the
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a couple
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to freez
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up since
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes, just
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to start the
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the data is
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma brièveté.
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Krutika Dhananjay
2016-04-18 14:28:44 UTC
Permalink
Hi,

Yeah, so the fuse mount log didn't convey much information.

So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal from
both source bricks in 3-way replication. With such a bug, heal would take
twice the amount of time and consume resources both the times by the same
amount.

This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
available in 3.7.12.

The other thing you could do is to set cluster.data-self-heal-algorithm to
'full', for better heal performance and more regulated resource consumption
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full

As far as sharding is concerned, some critical caching issues were fixed in
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be because of this
issue, which exists in 3.7.6.

3.7.10 saw the introduction of throttled client side heals which also moves
such heals to the background, which is all the more helpful for preventing
starvation of vms during client heal.

Considering these factors, I think it would be better if you upgraded your
machines to 3.7.10.

Do let me know if migrating to 3.7.10 solves your issues.

-Krutika
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount point, I think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2) rebooted,
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal would go
well but everything went down
like in the morning so I disconnected it again, and waited 11pm (23:00) to
reconnect it and let it finish.
Thanks for your help,
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib, so if
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal on
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working now, but
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the healing
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running with
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like the
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that matters.
This morning the second node crashed, and as it came back up started
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that volume.
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing the
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a couple
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to freez
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up since
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes, just
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to start the
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the data
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma
briÚveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Kevin Lemonnier
2016-04-18 14:32:10 UTC
Permalink
I will try migrating to 3.7.10, is it considered stable yet ?

Should I change the self heal algorithm even if I move to 3.7.10, or is that not necessary ?
Not sure what that change might do.

Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move the VMs on it then,
Thanks a lot for your help,

Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal from
both source bricks in 3-way replication. With such a bug, heal would take
twice the amount of time and consume resources both the times by the same
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
available in 3.7.12.
The other thing you could do is to set cluster.data-self-heal-algorithm to
'full', for better heal performance and more regulated resource consumption
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were fixed in
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be because of this
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which also moves
such heals to the background, which is all the more helpful for preventing
starvation of vms during client heal.
Considering these factors, I think it would be better if you upgraded your
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount point, I think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2) rebooted,
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal would go
well but everything went down
like in the morning so I disconnected it again, and waited 11pm (23:00) to
reconnect it and let it finish.
Thanks for your help,
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib, so if
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal on
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working now, but
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the healing
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running with
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like the
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that matters.
This morning the second node crashed, and as it came back up started
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that volume.
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing the
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a couple
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to freez
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up since
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes, just
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to start the
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the data
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma
brièveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Krutika Dhananjay
2016-04-18 14:47:05 UTC
Permalink
Post by Kevin Lemonnier
I will try migrating to 3.7.10, is it considered stable yet ?
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Post by Kevin Lemonnier
Should I change the self heal algorithm even if I move to 3.7.10, or is
that not necessary ?
Not sure what that change might do.
So the other algorithm is 'diff' which computes rolling checksum on chunks
of the src(es) and sink(s), compares them and heals upon mismatch. This is
known to consume lot of CPU. 'full' algo on the other hand simply copies
the src into sink in chunks. With sharding, it shouldn't be all that bad
copying a 256MB file (in your case) from src to sink. We've used double the
block size and had no issues reported.

So you could change self heal algo to full even in the upgraded cluster.

-Krutika
Post by Kevin Lemonnier
Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move the VMs on it then,
Thanks a lot for your help,
Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal from
both source bricks in 3-way replication. With such a bug, heal would take
twice the amount of time and consume resources both the times by the same
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
available in 3.7.12.
The other thing you could do is to set cluster.data-self-heal-algorithm
to
Post by Krutika Dhananjay
'full', for better heal performance and more regulated resource
consumption
Post by Krutika Dhananjay
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were fixed
in
Post by Krutika Dhananjay
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be because of
this
Post by Krutika Dhananjay
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which also
moves
Post by Krutika Dhananjay
such heals to the background, which is all the more helpful for
preventing
Post by Krutika Dhananjay
starvation of vms during client heal.
Considering these factors, I think it would be better if you upgraded
your
Post by Krutika Dhananjay
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount point, I
think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
Post by Krutika Dhananjay
Post by Kevin Lemonnier
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)
rebooted,
Post by Krutika Dhananjay
Post by Kevin Lemonnier
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal
would go
Post by Krutika Dhananjay
Post by Kevin Lemonnier
well but everything went down
like in the morning so I disconnected it again, and waited 11pm
(23:00) to
Post by Krutika Dhananjay
Post by Kevin Lemonnier
reconnect it and let it finish.
Thanks for your help,
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib,
so if
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal
on
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working now,
but
Post by Krutika Dhananjay
Post by Kevin Lemonnier
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the
healing
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running
with
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that
matters.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
This morning the second node crashed, and as it came back up
started
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that
volume.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a
couple
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to
freez
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up
since
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes,
just
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to
start the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the
data
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma
briÚveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Lindsay Mathieson
2016-04-18 23:38:44 UTC
Permalink
Post by Krutika Dhananjay
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Wasn't the 3.7.10 regression only a problem on reboots if you used
gluster snapshots? (which proxmox doesn't). I'm currently using 3.7.10,
no issues with restarts so far.
Post by Krutika Dhananjay
So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal
from both source bricks in 3-way replication. With such a bug, heal
would take twice the amount of time and consume resources both the
times by the same amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and will
be available in 3.7.12.
Rats, I thought it was in 3.7.11 :( but can't it also be worked round by
disabling "cluster.data-self-heal"? or was that something else.


I'm getting pretty good heal performance with the following settings:

- features.shard-block-size: 64MB
- cluster.self-heal-window-size: 1024
--
Lindsay Mathieson
Kevin Lemonnier
2016-04-25 12:01:09 UTC
Permalink
Hi,

So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.

Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.

I'm gonna let it finish to copy everything on the new nodes, then I'll try
to simulate nodes going down to see if my original problem of freezing and
low heal time is solved with this config.
For reference, here is the volume info, if someone sees something I should change :

Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
Options Reconfigured:
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on


It starts at 2 and jumps to 50 because the first server is doing something else for now,
and I use 50 to be the temporary third node. If everything goes well, I'll migrate the production
on the cluster, re-install the first server and do a replace-brick, which I hope will work just as well
as the add-brick I'm doing now. Last replace-brick also brought everything down, but I guess that was the
joy of 3.7.6 :).

Thanks !
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I will try migrating to 3.7.10, is it considered stable yet ?
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Post by Kevin Lemonnier
Should I change the self heal algorithm even if I move to 3.7.10, or is
that not necessary ?
Not sure what that change might do.
So the other algorithm is 'diff' which computes rolling checksum on chunks
of the src(es) and sink(s), compares them and heals upon mismatch. This is
known to consume lot of CPU. 'full' algo on the other hand simply copies
the src into sink in chunks. With sharding, it shouldn't be all that bad
copying a 256MB file (in your case) from src to sink. We've used double the
block size and had no issues reported.
So you could change self heal algo to full even in the upgraded cluster.
-Krutika
Post by Kevin Lemonnier
Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move
the VMs on it then,
Thanks a lot for your help,
Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal from
both source bricks in 3-way replication. With such a bug, heal would take
twice the amount of time and consume resources both the times by the same
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
available in 3.7.12.
The other thing you could do is to set cluster.data-self-heal-algorithm
to
Post by Krutika Dhananjay
'full', for better heal performance and more regulated resource
consumption
Post by Krutika Dhananjay
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were fixed
in
Post by Krutika Dhananjay
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be because of
this
Post by Krutika Dhananjay
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which also
moves
Post by Krutika Dhananjay
such heals to the background, which is all the more helpful for
preventing
Post by Krutika Dhananjay
starvation of vms during client heal.
Considering these factors, I think it would be better if you upgraded
your
Post by Krutika Dhananjay
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount point, I
think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
Post by Krutika Dhananjay
Post by Kevin Lemonnier
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)
rebooted,
Post by Krutika Dhananjay
Post by Kevin Lemonnier
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal
would go
Post by Krutika Dhananjay
Post by Kevin Lemonnier
well but everything went down
like in the morning so I disconnected it again, and waited 11pm
(23:00) to
Post by Krutika Dhananjay
Post by Kevin Lemonnier
reconnect it and let it finish.
Thanks for your help,
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib,
so if
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal
on
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working now,
but
Post by Krutika Dhananjay
Post by Kevin Lemonnier
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the
healing
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx
time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running
with
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that
matters.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
This morning the second node crashed, and as it came back up
started
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that
volume.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a
couple
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to
freez
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up
since
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes,
just
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to
start the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the
data
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma
brièveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Lindsay Mathieson
2016-04-25 12:50:53 UTC
Permalink
Good luck!
Post by Kevin Lemonnier
Hi,
So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.
Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.
I'm gonna let it finish to copy everything on the new nodes, then I'll try
to simulate nodes going down to see if my original problem of freezing and
low heal time is solved with this config.
Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on
It starts at 2 and jumps to 50 because the first server is doing something else for now,
and I use 50 to be the temporary third node. If everything goes well, I'll migrate the production
on the cluster, re-install the first server and do a replace-brick, which I hope will work just as well
as the add-brick I'm doing now. Last replace-brick also brought everything down, but I guess that was the
joy of 3.7.6 :).
Thanks !
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I will try migrating to 3.7.10, is it considered stable yet ?
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Post by Kevin Lemonnier
Should I change the self heal algorithm even if I move to 3.7.10, or is
that not necessary ?
Not sure what that change might do.
So the other algorithm is 'diff' which computes rolling checksum on chunks
of the src(es) and sink(s), compares them and heals upon mismatch. This is
known to consume lot of CPU. 'full' algo on the other hand simply copies
the src into sink in chunks. With sharding, it shouldn't be all that bad
copying a 256MB file (in your case) from src to sink. We've used double the
block size and had no issues reported.
So you could change self heal algo to full even in the upgraded cluster.
-Krutika
Post by Kevin Lemonnier
Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move
the VMs on it then,
Thanks a lot for your help,
Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal from
both source bricks in 3-way replication. With such a bug, heal would take
twice the amount of time and consume resources both the times by the same
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
available in 3.7.12.
The other thing you could do is to set cluster.data-self-heal-algorithm
to
Post by Krutika Dhananjay
'full', for better heal performance and more regulated resource
consumption
Post by Krutika Dhananjay
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were fixed
in
Post by Krutika Dhananjay
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be because of
this
Post by Krutika Dhananjay
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which also
moves
Post by Krutika Dhananjay
such heals to the background, which is all the more helpful for
preventing
Post by Krutika Dhananjay
starvation of vms during client heal.
Considering these factors, I think it would be better if you upgraded
your
Post by Krutika Dhananjay
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount point, I
think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
Post by Krutika Dhananjay
Post by Kevin Lemonnier
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)
rebooted,
Post by Krutika Dhananjay
Post by Kevin Lemonnier
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal
would go
Post by Krutika Dhananjay
Post by Kevin Lemonnier
well but everything went down
like in the morning so I disconnected it again, and waited 11pm
(23:00) to
Post by Krutika Dhananjay
Post by Kevin Lemonnier
reconnect it and let it finish.
Thanks for your help,
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib,
so if
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal
on
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working now,
but
Post by Krutika Dhananjay
Post by Kevin Lemonnier
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the
healing
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running
with
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that
matters.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
This morning the second node crashed, and as it came back up
started
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that
volume.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a
couple
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to
freez
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up
since
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes,
just
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to
start the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the
data
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma
brièveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Lindsay Mathieson
Kevin Lemonnier
2016-05-02 09:05:10 UTC
Permalink
Hi,

So after some testing, it is a lot better but I do still have some problems with 3.7.11.
When I reboot a server it seems to have some strange behaviour sometimes, but I need to test
that better.
Removing a server from the network, waiting for a while then adding it back and letting it heal
works perfectly, completly invisible for the user and that's perfect !

However when I add a brick, changing the replica count from 2 to 3, it starts a heal
and some VMs switch to read only. I have to power them off then on again to fix it,
clearly it's better than with 3.7.6 which froze the VM until the heal was complete,
but I would still like to understand why some of the VMs are switching to readonly.
Looks like it happens everytime I add a brick to increase the replica, I would like
to test adding a whole replica set at once but I just don't have the hardware for that.

Rebooting a node looks like it's making some VMs go read only too, but I need to test
that better. For some reason it looks like rebooting a brick or adding a brick is causing
I/O errors on some VM disks and not others, and I have to power them off and then on to fix it.
I can't just reboot them, I guess I have to actually re-open the file to trigger a heal ?

Any idea on how to prevent that ? It's a lot better than 3.7.6 'cause it can be fixed in a minute,
but that's still not great to explain to the clients.

Thanks
Post by Kevin Lemonnier
Hi,
So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.
Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.
I'm gonna let it finish to copy everything on the new nodes, then I'll try
to simulate nodes going down to see if my original problem of freezing and
low heal time is solved with this config.
Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on
It starts at 2 and jumps to 50 because the first server is doing something else for now,
and I use 50 to be the temporary third node. If everything goes well, I'll migrate the production
on the cluster, re-install the first server and do a replace-brick, which I hope will work just as well
as the add-brick I'm doing now. Last replace-brick also brought everything down, but I guess that was the
joy of 3.7.6 :).
Thanks !
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I will try migrating to 3.7.10, is it considered stable yet ?
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Post by Kevin Lemonnier
Should I change the self heal algorithm even if I move to 3.7.10, or is
that not necessary ?
Not sure what that change might do.
So the other algorithm is 'diff' which computes rolling checksum on chunks
of the src(es) and sink(s), compares them and heals upon mismatch. This is
known to consume lot of CPU. 'full' algo on the other hand simply copies
the src into sink in chunks. With sharding, it shouldn't be all that bad
copying a 256MB file (in your case) from src to sink. We've used double the
block size and had no issues reported.
So you could change self heal algo to full even in the upgraded cluster.
-Krutika
Post by Kevin Lemonnier
Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move
the VMs on it then,
Thanks a lot for your help,
Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal from
both source bricks in 3-way replication. With such a bug, heal would take
twice the amount of time and consume resources both the times by the same
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
available in 3.7.12.
The other thing you could do is to set cluster.data-self-heal-algorithm
to
Post by Krutika Dhananjay
'full', for better heal performance and more regulated resource
consumption
Post by Krutika Dhananjay
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were fixed
in
Post by Krutika Dhananjay
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be because of
this
Post by Krutika Dhananjay
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which also
moves
Post by Krutika Dhananjay
such heals to the background, which is all the more helpful for
preventing
Post by Krutika Dhananjay
starvation of vms during client heal.
Considering these factors, I think it would be better if you upgraded
your
Post by Krutika Dhananjay
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount point, I
think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
Post by Krutika Dhananjay
Post by Kevin Lemonnier
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)
rebooted,
Post by Krutika Dhananjay
Post by Kevin Lemonnier
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal
would go
Post by Krutika Dhananjay
Post by Kevin Lemonnier
well but everything went down
like in the morning so I disconnected it again, and waited 11pm
(23:00) to
Post by Krutika Dhananjay
Post by Kevin Lemonnier
reconnect it and let it finish.
Thanks for your help,
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the lib,
so if
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on every heal
on
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working now,
but
Post by Krutika Dhananjay
Post by Kevin Lemonnier
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the
healing
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the approx
time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes running
with
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option like
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if that
matters.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
This morning the second node crashed, and as it came back up
started
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on that
volume.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day, I just
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal, allowing
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node in a
couple
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to
freez
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way up
since
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both nodes,
just
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to
start the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so the
data
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma
brièveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Krutika Dhananjay
2016-05-02 09:19:35 UTC
Permalink
Could you attach the glusterfs client, shd logs?

-Krutika
Post by Kevin Lemonnier
Hi,
So after some testing, it is a lot better but I do still have some problems with 3.7.11.
When I reboot a server it seems to have some strange behaviour sometimes,
but I need to test
that better.
Removing a server from the network, waiting for a while then adding it
back and letting it heal
works perfectly, completly invisible for the user and that's perfect !
However when I add a brick, changing the replica count from 2 to 3, it starts a heal
and some VMs switch to read only. I have to power them off then on again to fix it,
clearly it's better than with 3.7.6 which froze the VM until the heal was complete,
but I would still like to understand why some of the VMs are switching to readonly.
Looks like it happens everytime I add a brick to increase the replica, I would like
to test adding a whole replica set at once but I just don't have the hardware for that.
Rebooting a node looks like it's making some VMs go read only too, but I need to test
that better. For some reason it looks like rebooting a brick or adding a brick is causing
I/O errors on some VM disks and not others, and I have to power them off
and then on to fix it.
I can't just reboot them, I guess I have to actually re-open the file to trigger a heal ?
Any idea on how to prevent that ? It's a lot better than 3.7.6 'cause it
can be fixed in a minute,
but that's still not great to explain to the clients.
Thanks
Post by Kevin Lemonnier
Hi,
So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.
Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.
I'm gonna let it finish to copy everything on the new nodes, then I'll
try
Post by Kevin Lemonnier
to simulate nodes going down to see if my original problem of freezing
and
Post by Kevin Lemonnier
low heal time is solved with this config.
For reference, here is the volume info, if someone sees something I
Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on
It starts at 2 and jumps to 50 because the first server is doing
something else for now,
Post by Kevin Lemonnier
and I use 50 to be the temporary third node. If everything goes well,
I'll migrate the production
Post by Kevin Lemonnier
on the cluster, re-install the first server and do a replace-brick,
which I hope will work just as well
Post by Kevin Lemonnier
as the add-brick I'm doing now. Last replace-brick also brought
everything down, but I guess that was the
Post by Kevin Lemonnier
joy of 3.7.6 :).
Thanks !
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I will try migrating to 3.7.10, is it considered stable yet ?
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Post by Kevin Lemonnier
Should I change the self heal algorithm even if I move to 3.7.10, or
is
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
that not necessary ?
Not sure what that change might do.
So the other algorithm is 'diff' which computes rolling checksum on
chunks
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of the src(es) and sink(s), compares them and heals upon mismatch.
This is
Post by Kevin Lemonnier
Post by Krutika Dhananjay
known to consume lot of CPU. 'full' algo on the other hand simply
copies
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the src into sink in chunks. With sharding, it shouldn't be all that
bad
Post by Kevin Lemonnier
Post by Krutika Dhananjay
copying a 256MB file (in your case) from src to sink. We've used
double the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
block size and had no issues reported.
So you could change self heal algo to full even in the upgraded
cluster.
Post by Kevin Lemonnier
Post by Krutika Dhananjay
-Krutika
Post by Kevin Lemonnier
Anyway, I'll try to create a 3.7.10 cluster in the week end slowly
move
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the VMs on it then,
Thanks a lot for your help,
Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also
consumed
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
resources) is because of a bug in self-heal where it would do heal
from
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
both source bricks in 3-way replication. With such a bug, heal
would take
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
twice the amount of time and consume resources both the times by
the same
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and
will be
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
available in 3.7.12.
The other thing you could do is to set
cluster.data-self-heal-algorithm
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
to
Post by Krutika Dhananjay
'full', for better heal performance and more regulated resource
consumption
Post by Krutika Dhananjay
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were
fixed
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
in
Post by Krutika Dhananjay
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be
because of
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
this
Post by Krutika Dhananjay
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which
also
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
moves
Post by Krutika Dhananjay
such heals to the background, which is all the more helpful for
preventing
Post by Krutika Dhananjay
starvation of vms during client heal.
Considering these factors, I think it would be better if you
upgraded
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
your
Post by Krutika Dhananjay
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount
point, I
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do
seem to
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick
(10.10.0.1) in
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right
after that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)
rebooted,
Post by Krutika Dhananjay
Post by Kevin Lemonnier
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal
would go
Post by Krutika Dhananjay
Post by Kevin Lemonnier
well but everything went down
like in the morning so I disconnected it again, and waited 11pm
(23:00) to
Post by Krutika Dhananjay
Post by Kevin Lemonnier
reconnect it and let it finish.
Thanks for your help,
On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the
lib,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
so if
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on
every heal
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working
now,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
but
Post by Krutika Dhananjay
Post by Kevin Lemonnier
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact
that 2
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the
healing
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the
approx
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes
running
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
with
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option
like
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if
that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
matters.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
This morning the second node crashed, and as it came back
up
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
started
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on
that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
volume.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day,
I just
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal,
allowing
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node
in a
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
couple
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems
to
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
freez
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way
up
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
since
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both
nodes,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
just
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would
expect the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to
start the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so
the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
data
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez
excuser ma
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
briÚveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Kevin Lemonnier
2016-05-02 09:57:19 UTC
Permalink
Like last time, I'm using Proxmox so I don't have a client log, it's using the lib.
I attach the shd log from the first node, do you need the other two maybe ?
I tar.gz'ed it, hope that's okay. In case it's not clear from the logs, I removed the brick
on ipvr50 then added it again (after rm -Rf /mnt/storage/gluster on it, of course).

thanks
Post by Krutika Dhananjay
Could you attach the glusterfs client, shd logs?
-Krutika
Post by Kevin Lemonnier
Hi,
So after some testing, it is a lot better but I do still have some
problems with 3.7.11.
When I reboot a server it seems to have some strange behaviour sometimes,
but I need to test
that better.
Removing a server from the network, waiting for a while then adding it
back and letting it heal
works perfectly, completly invisible for the user and that's perfect !
However when I add a brick, changing the replica count from 2 to 3, it starts a heal
and some VMs switch to read only. I have to power them off then on again to fix it,
clearly it's better than with 3.7.6 which froze the VM until the heal was complete,
but I would still like to understand why some of the VMs are switching to readonly.
Looks like it happens everytime I add a brick to increase the replica, I would like
to test adding a whole replica set at once but I just don't have the hardware for that.
Rebooting a node looks like it's making some VMs go read only too, but I need to test
that better. For some reason it looks like rebooting a brick or adding a
brick is causing
I/O errors on some VM disks and not others, and I have to power them off
and then on to fix it.
I can't just reboot them, I guess I have to actually re-open the file to
trigger a heal ?
Any idea on how to prevent that ? It's a lot better than 3.7.6 'cause it
can be fixed in a minute,
but that's still not great to explain to the clients.
Thanks
Post by Kevin Lemonnier
Hi,
So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.
Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.
I'm gonna let it finish to copy everything on the new nodes, then I'll
try
Post by Kevin Lemonnier
to simulate nodes going down to see if my original problem of freezing
and
Post by Kevin Lemonnier
low heal time is solved with this config.
For reference, here is the volume info, if someone sees something I
Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on
It starts at 2 and jumps to 50 because the first server is doing
something else for now,
Post by Kevin Lemonnier
and I use 50 to be the temporary third node. If everything goes well,
I'll migrate the production
Post by Kevin Lemonnier
on the cluster, re-install the first server and do a replace-brick,
which I hope will work just as well
Post by Kevin Lemonnier
as the add-brick I'm doing now. Last replace-brick also brought
everything down, but I guess that was the
Post by Kevin Lemonnier
joy of 3.7.6 :).
Thanks !
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I will try migrating to 3.7.10, is it considered stable yet ?
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Post by Kevin Lemonnier
Should I change the self heal algorithm even if I move to 3.7.10, or
is
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
that not necessary ?
Not sure what that change might do.
So the other algorithm is 'diff' which computes rolling checksum on
chunks
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of the src(es) and sink(s), compares them and heals upon mismatch.
This is
Post by Kevin Lemonnier
Post by Krutika Dhananjay
known to consume lot of CPU. 'full' algo on the other hand simply
copies
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the src into sink in chunks. With sharding, it shouldn't be all that
bad
Post by Kevin Lemonnier
Post by Krutika Dhananjay
copying a 256MB file (in your case) from src to sink. We've used
double the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
block size and had no issues reported.
So you could change self heal algo to full even in the upgraded
cluster.
Post by Kevin Lemonnier
Post by Krutika Dhananjay
-Krutika
Post by Kevin Lemonnier
Anyway, I'll try to create a 3.7.10 cluster in the week end slowly
move
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the VMs on it then,
Thanks a lot for your help,
Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also
consumed
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
resources) is because of a bug in self-heal where it would do heal
from
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
both source bricks in 3-way replication. With such a bug, heal
would take
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
twice the amount of time and consume resources both the times by
the same
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and
will be
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
available in 3.7.12.
The other thing you could do is to set
cluster.data-self-heal-algorithm
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
to
Post by Krutika Dhananjay
'full', for better heal performance and more regulated resource
consumption
Post by Krutika Dhananjay
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were
fixed
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
in
Post by Krutika Dhananjay
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be
because of
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
this
Post by Krutika Dhananjay
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which
also
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
moves
Post by Krutika Dhananjay
such heals to the background, which is all the more helpful for
preventing
Post by Krutika Dhananjay
starvation of vms during client heal.
Considering these factors, I think it would be better if you
upgraded
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
your
Post by Krutika Dhananjay
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount
point, I
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do
seem to
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick
(10.10.0.1) in
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right
after that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)
rebooted,
Post by Krutika Dhananjay
Post by Kevin Lemonnier
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal
would go
Post by Krutika Dhananjay
Post by Kevin Lemonnier
well but everything went down
like in the morning so I disconnected it again, and waited 11pm
(23:00) to
Post by Krutika Dhananjay
Post by Kevin Lemonnier
reconnect it and let it finish.
Thanks for your help,
On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the
lib,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
so if
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on
every heal
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working
now,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
but
Post by Krutika Dhananjay
Post by Kevin Lemonnier
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact
that 2
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the
healing
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the
approx
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes
running
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
with
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option
like
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if
that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
matters.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
This morning the second node crashed, and as it came back
up
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
started
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on
that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
volume.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day,
I just
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal,
allowing
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node
in a
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
couple
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems
to
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
freez
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way
up
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
since
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both
nodes,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
just
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would
expect the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to
start the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so
the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
data
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez
excuser ma
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
brièveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Kevin Lemonnier
2016-05-02 11:38:58 UTC
Permalink
After some more testing it looks like rebooting a server is fine,
everything continues to work during the reboot and then during the heal,
exactly like when I simulate a network outage.
I guess my problems were just leftovers from adding a brick earlier, looks like
that causes real problem and sometimes VMs switch to readonly a while after the add.

From the add brick I did this morning, only two VMs seems to have "resisted" it
without having to reboot, on 8 VM total. About half of them had to be rebooted
as soon as the "heal" started, and two of them didn't switch to read only
but did complain about errors in the console, so I just rebooted them to be sure.
They were fine during the heal, they started complaining after the heal finished,
but I'm guessing that's just because they weren't accessing their disks a lot.
Post by Kevin Lemonnier
Like last time, I'm using Proxmox so I don't have a client log, it's using the lib.
I attach the shd log from the first node, do you need the other two maybe ?
I tar.gz'ed it, hope that's okay. In case it's not clear from the logs, I removed the brick
on ipvr50 then added it again (after rm -Rf /mnt/storage/gluster on it, of course).
thanks
Post by Krutika Dhananjay
Could you attach the glusterfs client, shd logs?
-Krutika
Post by Kevin Lemonnier
Hi,
So after some testing, it is a lot better but I do still have some
problems with 3.7.11.
When I reboot a server it seems to have some strange behaviour sometimes,
but I need to test
that better.
Removing a server from the network, waiting for a while then adding it
back and letting it heal
works perfectly, completly invisible for the user and that's perfect !
However when I add a brick, changing the replica count from 2 to 3, it
starts a heal
and some VMs switch to read only. I have to power them off then on again to fix it,
clearly it's better than with 3.7.6 which froze the VM until the heal was complete,
but I would still like to understand why some of the VMs are switching to readonly.
Looks like it happens everytime I add a brick to increase the replica, I would like
to test adding a whole replica set at once but I just don't have the
hardware for that.
Rebooting a node looks like it's making some VMs go read only too, but I
need to test
that better. For some reason it looks like rebooting a brick or adding a
brick is causing
I/O errors on some VM disks and not others, and I have to power them off
and then on to fix it.
I can't just reboot them, I guess I have to actually re-open the file to
trigger a heal ?
Any idea on how to prevent that ? It's a lot better than 3.7.6 'cause it
can be fixed in a minute,
but that's still not great to explain to the clients.
Thanks
Post by Kevin Lemonnier
Hi,
So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.
Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.
I'm gonna let it finish to copy everything on the new nodes, then I'll
try
Post by Kevin Lemonnier
to simulate nodes going down to see if my original problem of freezing
and
Post by Kevin Lemonnier
low heal time is solved with this config.
For reference, here is the volume info, if someone sees something I
Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on
It starts at 2 and jumps to 50 because the first server is doing
something else for now,
Post by Kevin Lemonnier
and I use 50 to be the temporary third node. If everything goes well,
I'll migrate the production
Post by Kevin Lemonnier
on the cluster, re-install the first server and do a replace-brick,
which I hope will work just as well
Post by Kevin Lemonnier
as the add-brick I'm doing now. Last replace-brick also brought
everything down, but I guess that was the
Post by Kevin Lemonnier
joy of 3.7.6 :).
Thanks !
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I will try migrating to 3.7.10, is it considered stable yet ?
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
Post by Kevin Lemonnier
Should I change the self heal algorithm even if I move to 3.7.10, or
is
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
that not necessary ?
Not sure what that change might do.
So the other algorithm is 'diff' which computes rolling checksum on
chunks
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of the src(es) and sink(s), compares them and heals upon mismatch.
This is
Post by Kevin Lemonnier
Post by Krutika Dhananjay
known to consume lot of CPU. 'full' algo on the other hand simply
copies
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the src into sink in chunks. With sharding, it shouldn't be all that
bad
Post by Kevin Lemonnier
Post by Krutika Dhananjay
copying a 256MB file (in your case) from src to sink. We've used
double the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
block size and had no issues reported.
So you could change self heal algo to full even in the upgraded
cluster.
Post by Kevin Lemonnier
Post by Krutika Dhananjay
-Krutika
Post by Kevin Lemonnier
Anyway, I'll try to create a 3.7.10 cluster in the week end slowly
move
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the VMs on it then,
Thanks a lot for your help,
Regards
Post by Krutika Dhananjay
Hi,
Yeah, so the fuse mount log didn't convey much information.
So one of the reasons heal may have taken so long (and also
consumed
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
resources) is because of a bug in self-heal where it would do heal
from
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
both source bricks in 3-way replication. With such a bug, heal
would take
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
twice the amount of time and consume resources both the times by
the same
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
amount.
This issue is fixed at http://review.gluster.org/#/c/14008/ and
will be
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
available in 3.7.12.
The other thing you could do is to set
cluster.data-self-heal-algorithm
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
to
Post by Krutika Dhananjay
'full', for better heal performance and more regulated resource
consumption
Post by Krutika Dhananjay
by the same.
#gluster volume set <VOL> cluster.data-self-heal-algorithm full
As far as sharding is concerned, some critical caching issues were
fixed
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
in
Post by Krutika Dhananjay
3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be
because of
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
this
Post by Krutika Dhananjay
issue, which exists in 3.7.6.
3.7.10 saw the introduction of throttled client side heals which
also
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
moves
Post by Krutika Dhananjay
such heals to the background, which is all the more helpful for
preventing
Post by Krutika Dhananjay
starvation of vms during client heal.
Considering these factors, I think it would be better if you
upgraded
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
your
Post by Krutika Dhananjay
machines to 3.7.10.
Do let me know if migrating to 3.7.10 solves your issues.
-Krutika
On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
Yes, but as I was saying I don't believe KVM is using a mount
point, I
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
think it uses
the API (
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do
seem to
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick
(10.10.0.1) in
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right
after that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)
rebooted,
Post by Krutika Dhananjay
Post by Kevin Lemonnier
and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal
would go
Post by Krutika Dhananjay
Post by Kevin Lemonnier
well but everything went down
like in the morning so I disconnected it again, and waited 11pm
(23:00) to
Post by Krutika Dhananjay
Post by Kevin Lemonnier
reconnect it and let it finish.
Thanks for your help,
On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay
Post by Krutika Dhananjay
Sorry, I was referring to the glusterfs client logs.
Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log
-Krutika
On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
Post by Kevin Lemonnier
I believe Proxmox is just an interface to KVM that uses the
lib,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
so if
Post by Krutika Dhananjay
Post by Kevin Lemonnier
I'm
Post by Krutika Dhananjay
Post by Kevin Lemonnier
not mistaken there isn't client logs ?
It's not the first time I have the issue, it happens on
every heal
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
2 clusters I have.
I did let the heal finish that night and the VMs are working
now,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
but
Post by Krutika Dhananjay
Post by Kevin Lemonnier
it
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact
that 2
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
bricks
Post by Krutika Dhananjay
Post by Kevin Lemonnier
on 3 aren't keeping the filesystem usable but might make the
healing
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
quicker right ?
Thanks
Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
Post by Krutika Dhananjay
Could you share the client logs and information about the
approx
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
time/day
when you saw this issue?
-Krutika
On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
Post by Kevin Lemonnier
Hi,
We have a small glusterFS 3.7.6 cluster with 3 nodes
running
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
with
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
proxmox
Post by Kevin Lemonnier
VM's on it. I did set up the different recommended option
like
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
virt
Post by Kevin Lemonnier
group, but
by hand since it's on debian. The shards are 256MB, if
that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
matters.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
This morning the second node crashed, and as it came back
up
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
started
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
a
Post by Kevin Lemonnier
heal, but that basically froze all the VM's running on
that
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
volume.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Since
Post by Kevin Lemonnier
we really really
can't have 40 minutes down time in the middle of the day,
I just
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
removed
Post by Kevin Lemonnier
the node from the network and that stopped the heal,
allowing
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM's to
Post by Kevin Lemonnier
access
their disks again. The plan was to re-connecte the node
in a
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
couple
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
of
Post by Kevin Lemonnier
hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems
to
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
freez
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
trying
Post by Kevin Lemonnier
to access the disks.
Looking at the heal info for the volume, it has gone way
up
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
since
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
this
Post by Kevin Lemonnier
morning, it looks like the VM's aren't writing to both
nodes,
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
just
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
the one
Post by Kevin Lemonnier
they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would
expect the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
volume to
Post by Kevin Lemonnier
work just fine since it has quorum. What am I missing ?
It is still too early to start the heal, is there a way to
start the
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
VM
Post by Kevin Lemonnier
anyway right now ? I mean, it was running a moment ago so
the
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
data
Post by Krutika Dhananjay
Post by Kevin Lemonnier
is
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
there, it just needs
to let the VM access it.
Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%
Thanks for your help
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez
excuser ma
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
Post by Krutika Dhananjay
Post by Kevin Lemonnier
brièveté.
Post by Krutika Dhananjay
Post by Kevin Lemonnier
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Continue reading on narkive:
Loading...