Discussion:
VM fs becomes read only when one gluster node goes down
(too old to reply)
André Bauer
2015-10-22 18:45:04 UTC
Permalink
Hi,

i have a 4 node Glusterfs 3.5.6 Cluster.

My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.

Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.

When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.

How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.

Any hints?

Thanks in advance.
--
Regards
André Bauer
Krutika Dhananjay
2015-10-23 02:24:29 UTC
Permalink
Could you share the output of 'gluster volume info', and also information as to which node went down on reboot?

-Krutika
----- Original Message -----
Sent: Friday, October 23, 2015 12:15:04 AM
Subject: [Gluster-users] VM fs becomes read only when one gluster node goes
down
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
Thanks in advance.
--
Regards
André Bauer
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Roman
2015-10-26 13:33:57 UTC
Permalink
Hi,

got backupvolfile-server=NODE2NAMEHERE in fstab ? :)
Post by Krutika Dhananjay
Could you share the output of 'gluster volume info', and also information
as to which node went down on reboot?
-Krutika
------------------------------
*Sent: *Friday, October 23, 2015 12:15:04 AM
*Subject: *[Gluster-users] VM fs becomes read only when one gluster node
goes down
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
Thanks in advance.
--
Regards
André Bauer
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Best regards,
Roman.
Josh Boon
2015-10-26 16:41:22 UTC
Permalink
Andre,

I've not explored using a DNS solution to publish the gluster cluster addressing space but things you'll want to check out are network.ping-timeout and whether or not your VM goes read-only on filesystem error. If your network is consistent and robust tuning network.ping-timeout to a very low value such as three seconds will instruct the client to drop that client on failure. The default value for this is 42 seconds which will cause your VM to go read-only as you've seen. You could also choose to have your VM's mount their partitions errors=continue as well depending on the filesystem they run. Our setup has timeout at seven seconds and errors=continue and has survived both testing and storage node segfaults. No data integrity issues have presented yet but our data is mostly temporal so integrity hasn't been tested thoroughly. Also we're qemu 2.0 running gluster 3.6 on ubuntu 14.04 for those curious.

Best,
Josh


From: "Roman" <***@gmail.com>
To: "Krutika Dhananjay" <***@redhat.com>
Cc: "gluster-users" <gluster-***@gluster.org>, gluster-***@gluster.org
Sent: Monday, October 26, 2015 1:33:57 PM
Subject: Re: [Gluster-users] VM fs becomes read only when one gluster node goes down

Hi,
got backupvolfile-server=NODE2NAMEHERE in fstab ? :)

2015-10-23 5:24 GMT+03:00 Krutika Dhananjay < ***@redhat.com > :



Could you share the output of 'gluster volume info', and also information as to which node went down on reboot?

-Krutika


BQ_BEGIN
From: "André Bauer" < ***@magix.net >
To: "gluster-users" < gluster-***@gluster.org >
Cc: gluster-***@gluster.org
Sent: Friday, October 23, 2015 12:15:04 AM
Subject: [Gluster-users] VM fs becomes read only when one gluster node goes down

Hi,

i have a 4 node Glusterfs 3.5.6 Cluster.

My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.

Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.

When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.

How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.

Any hints?

Thanks in advance.
--
Regards
André Bauer

_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users



_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

BQ_END
--
Best regards,
Roman.

_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
André Bauer
2015-10-26 19:08:15 UTC
Permalink
Thanks guys!
My volume info is attached at the bottom of this mail...

@ Josh
As you can see, i already have a 5 second ping timeout set. I will try
it with 3 seconds.

Not sure, if i want to have errors=continue on the fs level but i will
give it a try, if its the only possibility to get automatic failover work.


@ Roman
I use qemu with libgfapi to access the images. So no glusterfs entries
in fstab for my vm hosts. It also seems this is kind of deprecated:

http://blog.gluster.org/category/mount-glusterfs/

"`backupvolfile-server` - This option did not really do much rather than
provide a 'shell' script based failover which was highly racy and
wouldn't work during many occasions. It was necessary to remove this to
make room for better options (while it is still provided for backward
compatibility in the code)"


@ all
Can anybody tell me how Glusterfs handles this internaly?
Is the libgfapi client already aware of the server which replicates the
image?
Is there a way i can configure it manualy for a volume?




Volume Name: vmimages
Type: Distributed-Replicate
Volume ID: 029285b2-dfad-4569-8060-3827c0f1d856
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: storage1.domain.local:/glusterfs/vmimages
Brick2: storage2.domain.local:/glusterfs/vmimages
Brick3: storage3.domain.local:/glusterfs/vmimages
Brick4: storage4.domain.local:/glusterfs/vmimages
Options Reconfigured:
network.ping-timeout: 5
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
auth.allow:
192.168.0.21,192.168.0.22,192.168.0.23,192.168.0.24,192.168.0.25,192.168.0.26
server.allow-insecure: on
storage.owner-uid: 2000
storage.owner-gid: 2000



Regards
André
Post by Josh Boon
Andre,
I've not explored using a DNS solution to publish the gluster cluster
addressing space but things you'll want to check out
are network.ping-timeout and whether or not your VM goes read-only on
filesystem error. If your network is consistent and robust
tuning network.ping-timeout to a very low value such as three seconds
will instruct the client to drop that client on failure. The default
value for this is 42 seconds which will cause your VM to go read-only as
you've seen. You could also choose to have your VM's mount their
partitions errors=continue as well depending on the filesystem they run.
Our setup has timeout at seven seconds and errors=continue and has
survived both testing and storage node segfaults. No data integrity
issues have presented yet but our data is mostly temporal so integrity
hasn't been tested thoroughly. Also we're qemu 2.0 running gluster 3.6
on ubuntu 14.04 for those curious.
Best,
Josh
------------------------------------------------------------------------
*Sent: *Monday, October 26, 2015 1:33:57 PM
*Subject: *Re: [Gluster-users] VM fs becomes read only when one gluster
node goes down
Hi,
got backupvolfile-server=NODE2NAMEHERE in fstab ? :)
Could you share the output of 'gluster volume info', and also
information as to which node went down on reboot?
-Krutika
------------------------------------------------------------------------
*Sent: *Friday, October 23, 2015 12:15:04 AM
*Subject: *[Gluster-users] VM fs becomes read only when one
gluster node goes down
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
Thanks in advance.
--
Regards
André Bauer
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Best regards,
Roman.
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Mit freundlichen Grüßen
André Bauer

MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: ***@magix.net
***@magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
----------------------------------------------------------------------
Josh Boon
2015-10-26 19:23:03 UTC
Permalink
Hmm even five should be OK. Do you lose all VMs or just some?

Also, we had issues with

cluster.quorum-type: auto
cluster.server-quorum-type: server

and had to instead go with

cluster.server-quorum-type: none
cluster.quorum-type: none

though we only replicate instead distribute and replicate so I'd be wary of changing those without advice from folks more familiar with the impact on your config.

gfapi upon connect gets the volume file and is aware of the configuration and changes to it so it should be OK when a node is lost since it knows where the other nodes are.

If you have a lab with your gluster config setup and you lose all of your VM's I'd suggest trying my config to see what happens. The gluster logs and qemu clients could also have some tips on what happens when a node disappears.
----- Original Message -----
From: "André Bauer" <***@magix.net>
To: "Josh Boon" <***@joshboon.com>
Cc: "Krutika Dhananjay" <***@redhat.com>, "gluster-users" <gluster-***@gluster.org>, gluster-***@gluster.org
Sent: Monday, October 26, 2015 7:08:15 PM
Subject: Re: [Gluster-users] VM fs becomes read only when one gluster node goes down

Thanks guys!
My volume info is attached at the bottom of this mail...

@ Josh
As you can see, i already have a 5 second ping timeout set. I will try
it with 3 seconds.

Not sure, if i want to have errors=continue on the fs level but i will
give it a try, if its the only possibility to get automatic failover work.


@ Roman
I use qemu with libgfapi to access the images. So no glusterfs entries
in fstab for my vm hosts. It also seems this is kind of deprecated:

http://blog.gluster.org/category/mount-glusterfs/

"`backupvolfile-server` - This option did not really do much rather than
provide a 'shell' script based failover which was highly racy and
wouldn't work during many occasions. It was necessary to remove this to
make room for better options (while it is still provided for backward
compatibility in the code)"


@ all
Can anybody tell me how Glusterfs handles this internaly?
Is the libgfapi client already aware of the server which replicates the
image?
Is there a way i can configure it manualy for a volume?




Volume Name: vmimages
Type: Distributed-Replicate
Volume ID: 029285b2-dfad-4569-8060-3827c0f1d856
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: storage1.domain.local:/glusterfs/vmimages
Brick2: storage2.domain.local:/glusterfs/vmimages
Brick3: storage3.domain.local:/glusterfs/vmimages
Brick4: storage4.domain.local:/glusterfs/vmimages
Options Reconfigured:
network.ping-timeout: 5
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
auth.allow:
192.168.0.21,192.168.0.22,192.168.0.23,192.168.0.24,192.168.0.25,192.168.0.26
server.allow-insecure: on
storage.owner-uid: 2000
storage.owner-gid: 2000



Regards
André
Post by Josh Boon
Andre,
I've not explored using a DNS solution to publish the gluster cluster
addressing space but things you'll want to check out
are network.ping-timeout and whether or not your VM goes read-only on
filesystem error. If your network is consistent and robust
tuning network.ping-timeout to a very low value such as three seconds
will instruct the client to drop that client on failure. The default
value for this is 42 seconds which will cause your VM to go read-only as
you've seen. You could also choose to have your VM's mount their
partitions errors=continue as well depending on the filesystem they run.
Our setup has timeout at seven seconds and errors=continue and has
survived both testing and storage node segfaults. No data integrity
issues have presented yet but our data is mostly temporal so integrity
hasn't been tested thoroughly. Also we're qemu 2.0 running gluster 3.6
on ubuntu 14.04 for those curious.
Best,
Josh
------------------------------------------------------------------------
*Sent: *Monday, October 26, 2015 1:33:57 PM
*Subject: *Re: [Gluster-users] VM fs becomes read only when one gluster
node goes down
Hi,
got backupvolfile-server=NODE2NAMEHERE in fstab ? :)
Could you share the output of 'gluster volume info', and also
information as to which node went down on reboot?
-Krutika
------------------------------------------------------------------------
*Sent: *Friday, October 23, 2015 12:15:04 AM
*Subject: *[Gluster-users] VM fs becomes read only when one
gluster node goes down
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
Thanks in advance.
--
Regards
André Bauer
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Best regards,
Roman.
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Mit freundlichen Grüßen
André Bauer

MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: ***@magix.net
***@magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
----------------------------------------------------------------------
André Bauer
2015-10-26 19:47:07 UTC
Permalink
Just some. But i think the reason is some vm images are replicated on
node 1 & 2 and some on node 3 & 4 because i use distributed/replicated
volume.

You're right. I think i have to try it on a testsetup.

At the moment i'm also no completly sure, if its a Glusterfs problem
(not connecting to the node with the replicated file immediately, when
read/write fails) or a problem of the filesystem (ext4 fs goes read only
on error to early)?


Regards
André
Post by Josh Boon
Hmm even five should be OK. Do you lose all VMs or just some?
Also, we had issues with
cluster.quorum-type: auto
cluster.server-quorum-type: server
and had to instead go with
cluster.server-quorum-type: none
cluster.quorum-type: none
though we only replicate instead distribute and replicate so I'd be wary of changing those without advice from folks more familiar with the impact on your config.
gfapi upon connect gets the volume file and is aware of the configuration and changes to it so it should be OK when a node is lost since it knows where the other nodes are.
If you have a lab with your gluster config setup and you lose all of your VM's I'd suggest trying my config to see what happens. The gluster logs and qemu clients could also have some tips on what happens when a node disappears.
----- Original Message -----
Sent: Monday, October 26, 2015 7:08:15 PM
Subject: Re: [Gluster-users] VM fs becomes read only when one gluster node goes down
Thanks guys!
My volume info is attached at the bottom of this mail...
@ Josh
As you can see, i already have a 5 second ping timeout set. I will try
it with 3 seconds.
Not sure, if i want to have errors=continue on the fs level but i will
give it a try, if its the only possibility to get automatic failover work.
@ Roman
I use qemu with libgfapi to access the images. So no glusterfs entries
http://blog.gluster.org/category/mount-glusterfs/
"`backupvolfile-server` - This option did not really do much rather than
provide a 'shell' script based failover which was highly racy and
wouldn't work during many occasions. It was necessary to remove this to
make room for better options (while it is still provided for backward
compatibility in the code)"
@ all
Can anybody tell me how Glusterfs handles this internaly?
Is the libgfapi client already aware of the server which replicates the
image?
Is there a way i can configure it manualy for a volume?
Volume Name: vmimages
Type: Distributed-Replicate
Volume ID: 029285b2-dfad-4569-8060-3827c0f1d856
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Brick1: storage1.domain.local:/glusterfs/vmimages
Brick2: storage2.domain.local:/glusterfs/vmimages
Brick3: storage3.domain.local:/glusterfs/vmimages
Brick4: storage4.domain.local:/glusterfs/vmimages
network.ping-timeout: 5
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
192.168.0.21,192.168.0.22,192.168.0.23,192.168.0.24,192.168.0.25,192.168.0.26
server.allow-insecure: on
storage.owner-uid: 2000
storage.owner-gid: 2000
Regards
André
Post by Josh Boon
Andre,
I've not explored using a DNS solution to publish the gluster cluster
addressing space but things you'll want to check out
are network.ping-timeout and whether or not your VM goes read-only on
filesystem error. If your network is consistent and robust
tuning network.ping-timeout to a very low value such as three seconds
will instruct the client to drop that client on failure. The default
value for this is 42 seconds which will cause your VM to go read-only as
you've seen. You could also choose to have your VM's mount their
partitions errors=continue as well depending on the filesystem they run.
Our setup has timeout at seven seconds and errors=continue and has
survived both testing and storage node segfaults. No data integrity
issues have presented yet but our data is mostly temporal so integrity
hasn't been tested thoroughly. Also we're qemu 2.0 running gluster 3.6
on ubuntu 14.04 for those curious.
Best,
Josh
------------------------------------------------------------------------
*Sent: *Monday, October 26, 2015 1:33:57 PM
*Subject: *Re: [Gluster-users] VM fs becomes read only when one gluster
node goes down
Hi,
got backupvolfile-server=NODE2NAMEHERE in fstab ? :)
Could you share the output of 'gluster volume info', and also
information as to which node went down on reboot?
-Krutika
------------------------------------------------------------------------
*Sent: *Friday, October 23, 2015 12:15:04 AM
*Subject: *[Gluster-users] VM fs becomes read only when one
gluster node goes down
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
Thanks in advance.
--
Regards
André Bauer
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Best regards,
Roman.
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Mit freundlichen Grüßen
André Bauer

MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: ***@magix.net
***@magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
----------------------------------------------------------------------
Josh Boon
2015-10-26 20:06:55 UTC
Permalink
I'd see what your qemu logs put out if you have them around from a crash. Also you can check client connections across your cluster by hopping on your hypervisor and grepping the output of netstat -np for the pid of one of your gluster backed VM's like so

netstat -np | grep 11607
tcp 0 0 10.9.1.1:60414 10.9.1.1:24007 ESTABLISHED 11607/qemu-system-x
tcp 0 0 10.9.1.1:60409 10.9.1.1:24007 ESTABLISHED 11607/qemu-system-x
tcp 0 0 10.9.1.1:45998 10.9.1.1:50152 ESTABLISHED 11607/qemu-system-x
tcp 0 0 10.9.1.1:42606 10.9.1.2:50152 ESTABLISHED 11607/qemu-system-x
tcp 0 0 10.9.1.1:45993 10.9.1.1:50152 ESTABLISHED 11607/qemu-system-x
tcp 0 0 10.9.1.1:42601 10.9.1.2:50152 ESTABLISHED 11607/qemu-system-x
unix 3 [ ] STREAM CONNECTED 32860 11607/qemu-system-x /var/lib/libvirt/qemu/HFMWEB19.monitor

I mounted two disks for the machine so I have two controls and two connections per disk for my replicated setup. Someone else might be able to provide more info as to what your output should look like.

----- Original Message -----
From: "André Bauer" <***@magix.net>
To: "Josh Boon" <***@joshboon.com>
Cc: "Krutika Dhananjay" <***@redhat.com>, "gluster-users" <gluster-***@gluster.org>, gluster-***@gluster.org
Sent: Monday, October 26, 2015 7:47:07 PM
Subject: Re: [Gluster-users] VM fs becomes read only when one gluster node goes down

Just some. But i think the reason is some vm images are replicated on
node 1 & 2 and some on node 3 & 4 because i use distributed/replicated
volume.

You're right. I think i have to try it on a testsetup.

At the moment i'm also no completly sure, if its a Glusterfs problem
(not connecting to the node with the replicated file immediately, when
read/write fails) or a problem of the filesystem (ext4 fs goes read only
on error to early)?


Regards
André
Post by Josh Boon
Hmm even five should be OK. Do you lose all VMs or just some?
Also, we had issues with
cluster.quorum-type: auto
cluster.server-quorum-type: server
and had to instead go with
cluster.server-quorum-type: none
cluster.quorum-type: none
though we only replicate instead distribute and replicate so I'd be wary of changing those without advice from folks more familiar with the impact on your config.
gfapi upon connect gets the volume file and is aware of the configuration and changes to it so it should be OK when a node is lost since it knows where the other nodes are.
If you have a lab with your gluster config setup and you lose all of your VM's I'd suggest trying my config to see what happens. The gluster logs and qemu clients could also have some tips on what happens when a node disappears.
----- Original Message -----
Sent: Monday, October 26, 2015 7:08:15 PM
Subject: Re: [Gluster-users] VM fs becomes read only when one gluster node goes down
Thanks guys!
My volume info is attached at the bottom of this mail...
@ Josh
As you can see, i already have a 5 second ping timeout set. I will try
it with 3 seconds.
Not sure, if i want to have errors=continue on the fs level but i will
give it a try, if its the only possibility to get automatic failover work.
@ Roman
I use qemu with libgfapi to access the images. So no glusterfs entries
http://blog.gluster.org/category/mount-glusterfs/
"`backupvolfile-server` - This option did not really do much rather than
provide a 'shell' script based failover which was highly racy and
wouldn't work during many occasions. It was necessary to remove this to
make room for better options (while it is still provided for backward
compatibility in the code)"
@ all
Can anybody tell me how Glusterfs handles this internaly?
Is the libgfapi client already aware of the server which replicates the
image?
Is there a way i can configure it manualy for a volume?
Volume Name: vmimages
Type: Distributed-Replicate
Volume ID: 029285b2-dfad-4569-8060-3827c0f1d856
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Brick1: storage1.domain.local:/glusterfs/vmimages
Brick2: storage2.domain.local:/glusterfs/vmimages
Brick3: storage3.domain.local:/glusterfs/vmimages
Brick4: storage4.domain.local:/glusterfs/vmimages
network.ping-timeout: 5
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
192.168.0.21,192.168.0.22,192.168.0.23,192.168.0.24,192.168.0.25,192.168.0.26
server.allow-insecure: on
storage.owner-uid: 2000
storage.owner-gid: 2000
Regards
André
Post by Josh Boon
Andre,
I've not explored using a DNS solution to publish the gluster cluster
addressing space but things you'll want to check out
are network.ping-timeout and whether or not your VM goes read-only on
filesystem error. If your network is consistent and robust
tuning network.ping-timeout to a very low value such as three seconds
will instruct the client to drop that client on failure. The default
value for this is 42 seconds which will cause your VM to go read-only as
you've seen. You could also choose to have your VM's mount their
partitions errors=continue as well depending on the filesystem they run.
Our setup has timeout at seven seconds and errors=continue and has
survived both testing and storage node segfaults. No data integrity
issues have presented yet but our data is mostly temporal so integrity
hasn't been tested thoroughly. Also we're qemu 2.0 running gluster 3.6
on ubuntu 14.04 for those curious.
Best,
Josh
------------------------------------------------------------------------
*Sent: *Monday, October 26, 2015 1:33:57 PM
*Subject: *Re: [Gluster-users] VM fs becomes read only when one gluster
node goes down
Hi,
got backupvolfile-server=NODE2NAMEHERE in fstab ? :)
Could you share the output of 'gluster volume info', and also
information as to which node went down on reboot?
-Krutika
------------------------------------------------------------------------
*Sent: *Friday, October 23, 2015 12:15:04 AM
*Subject: *[Gluster-users] VM fs becomes read only when one
gluster node goes down
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
Thanks in advance.
--
Regards
André Bauer
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Best regards,
Roman.
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Mit freundlichen Grüßen
André Bauer

MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: ***@magix.net
***@magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
----------------------------------------------------------------------
Niels de Vos
2015-10-26 20:56:24 UTC
Permalink
Post by André Bauer
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
There are at least two timeouts that are involved in this problem:

1. The filesystem in a VM can go read-only when the virtual disk where
the filesystem is located does not respond for a while.

2. When a storage server that holds a replica of the virtual disk
becomes unreachable, the Gluster client (qemu+libgfapi) waits for
max. network.ping-timeout seconds before it resumes I/O.

Once a filesystem in a VM goes read-only, you might be able to fsck and
re-mount it read-writable again. It is not something a VM will do by
itself.


The timeouts for (1) are set in sysfs:

$ cat /sys/block/sda/device/timeout
30

30 seconds is the default for SD-devices, and for testing you can change
it with an echo:

# echo 300 > /sys/block/sda/device/timeout

This is not a peristent change, you can create a udev-rule to apply this
change at bootup.

Some of the filesystem offer a mount option that can change the
behaviour after a disk error is detected. "man mount" shows the "errors"
option for ext*. Changing this to "continue" is not recommended, "abort"
or "panic" will be the most safe for your data.


The timeout mentioned in (2) is for the Gluster Volume, and checked by
the client. When a client does a write to a replicated volume, the write
needs to be acknowledged by both/all replicas. The client (libgfapi)
delays the reply to the application (qemu) until both/all replies from
the replicas has been received. This delay is configured as the volume
option network.ping-timeout (42 seconds by default).


Now, if the VM returns block errors after 30 seconds, and the client
waits up to 42 seconds for recovery, there is an issue... So, your
solution could be to increase the timeout for error detection of the
disks inside the VMs, and/or decrease the network.ping-timeout.

It would be interesting to know if adapting these values prevents the
read-only occurrences in your environment. If you do any testing with
this, please keep me informed about the results.

Niels
Roman
2015-10-26 23:56:31 UTC
Permalink
Aren't we are talking about this patch?
https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD
Post by Niels de Vos
Post by André Bauer
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
1. The filesystem in a VM can go read-only when the virtual disk where
the filesystem is located does not respond for a while.
2. When a storage server that holds a replica of the virtual disk
becomes unreachable, the Gluster client (qemu+libgfapi) waits for
max. network.ping-timeout seconds before it resumes I/O.
Once a filesystem in a VM goes read-only, you might be able to fsck and
re-mount it read-writable again. It is not something a VM will do by
itself.
$ cat /sys/block/sda/device/timeout
30
30 seconds is the default for SD-devices, and for testing you can change
# echo 300 > /sys/block/sda/device/timeout
This is not a peristent change, you can create a udev-rule to apply this
change at bootup.
Some of the filesystem offer a mount option that can change the
behaviour after a disk error is detected. "man mount" shows the "errors"
option for ext*. Changing this to "continue" is not recommended, "abort"
or "panic" will be the most safe for your data.
The timeout mentioned in (2) is for the Gluster Volume, and checked by
the client. When a client does a write to a replicated volume, the write
needs to be acknowledged by both/all replicas. The client (libgfapi)
delays the reply to the application (qemu) until both/all replies from
the replicas has been received. This delay is configured as the volume
option network.ping-timeout (42 seconds by default).
Now, if the VM returns block errors after 30 seconds, and the client
waits up to 42 seconds for recovery, there is an issue... So, your
solution could be to increase the timeout for error detection of the
disks inside the VMs, and/or decrease the network.ping-timeout.
It would be interesting to know if adapting these values prevents the
read-only occurrences in your environment. If you do any testing with
this, please keep me informed about the results.
Niels
_______________________________________________
Gluster-devel mailing list
http://www.gluster.org/mailman/listinfo/gluster-devel
--
Best regards,
Roman.
Niels de Vos
2015-10-27 09:13:29 UTC
Permalink
Post by Roman
Aren't we are talking about this patch?
https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD
No, a backup-volserver option is only effective while doing the initial
mount. In case the 1st storage server is not available to retrieve the
volume layout (.vol file), other servers can be used for backup. Once
the volume layout is known to the Gluster client, the client will talk
to all the bricks directly.

Also qemu+libgfapi does a "mount" of the volume, before it can open the
disk image. This "mount" is a library call, not the usual syscall, and
only fetches the volume layout from a GlusterD service.

HTH,
Niels
Post by Roman
Post by Niels de Vos
Post by André Bauer
Hi,
i have a 4 node Glusterfs 3.5.6 Cluster.
My VM images are in an replicated distributed volume which is accessed
from kvm/qemu via libgfapi.
Mount is against storage.domain.local which has IPs for all 4 Gluster
nodes set in DNS.
When one of the Gluster nodes goes down (accidently reboot) a lot of the
vms getting read only filesystem. Even when the node comes back up.
How can i prevent this?
I expect that the vm just uses the replicated file on the other node,
without getting ro fs.
Any hints?
1. The filesystem in a VM can go read-only when the virtual disk where
the filesystem is located does not respond for a while.
2. When a storage server that holds a replica of the virtual disk
becomes unreachable, the Gluster client (qemu+libgfapi) waits for
max. network.ping-timeout seconds before it resumes I/O.
Once a filesystem in a VM goes read-only, you might be able to fsck and
re-mount it read-writable again. It is not something a VM will do by
itself.
$ cat /sys/block/sda/device/timeout
30
30 seconds is the default for SD-devices, and for testing you can change
# echo 300 > /sys/block/sda/device/timeout
This is not a peristent change, you can create a udev-rule to apply this
change at bootup.
Some of the filesystem offer a mount option that can change the
behaviour after a disk error is detected. "man mount" shows the "errors"
option for ext*. Changing this to "continue" is not recommended, "abort"
or "panic" will be the most safe for your data.
The timeout mentioned in (2) is for the Gluster Volume, and checked by
the client. When a client does a write to a replicated volume, the write
needs to be acknowledged by both/all replicas. The client (libgfapi)
delays the reply to the application (qemu) until both/all replies from
the replicas has been received. This delay is configured as the volume
option network.ping-timeout (42 seconds by default).
Now, if the VM returns block errors after 30 seconds, and the client
waits up to 42 seconds for recovery, there is an issue... So, your
solution could be to increase the timeout for error detection of the
disks inside the VMs, and/or decrease the network.ping-timeout.
It would be interesting to know if adapting these values prevents the
read-only occurrences in your environment. If you do any testing with
this, please keep me informed about the results.
Niels
_______________________________________________
Gluster-devel mailing list
http://www.gluster.org/mailman/listinfo/gluster-devel
--
Best regards,
Roman.
André Bauer
2015-10-27 18:21:35 UTC
Permalink
Hi Niels,

my network.ping-timeout was already set to 5 seconds.

Unfortunately it seems i dont have the timout setting in Ubuntu 14.04
for my vda disk.

ls -al /sys/block/vda/device/ gives me only:

drwxr-xr-x 4 root root 0 Oct 26 20:21 ./
drwxr-xr-x 5 root root 0 Oct 26 20:21 ../
drwxr-xr-x 3 root root 0 Oct 26 20:21 block/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 device
lrwxrwxrwx 1 root root 0 Oct 27 18:13 driver ->
../../../../bus/virtio/drivers/virtio_blk/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 features
- -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias
drwxr-xr-x 2 root root 0 Oct 27 18:13 power/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 status
lrwxrwxrwx 1 root root 0 Oct 26 20:21 subsystem ->
../../../../bus/virtio/
- -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent
- -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor


Is the qourum setting a problem, if you only have 2 replicas?

My volume has this quorum options set:

cluster.quorum-type: auto
cluster.server-quorum-type: server

As i understand the documentation (
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A
dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html
), cluster.server-quorum-ratio is set to "< 50%" by default, which can
never happen if you only have 2 replicas and one node goes down, right?

Do in need cluster.server-quorum-ratio = 50% in this case?



@ Josh

Qemu had this in log for the time the vm got read only fs:

[2015-10-22 17:44:42.699990] E [socket.c:2244:socket_connect_finish]
0-vmimages-client-2: connection to 192.168.0.43:24007 failed
(Connection refused)
[2015-10-22 17:45:03.411721] E
[client-handshake.c:1760:client_query_portmap_cbk]
0-vmimages-client-2: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.

netstat looks good. As axpected i got connectiosn to all 4 Glusterfs
nodes at the moment.



@ Eivind
I don't think i had a split brain.
Only the vm got read only filesystem not the file on the Glusterfs node.



Regards
André
Post by Niels de Vos
1. The filesystem in a VM can go read-only when the virtual disk
where the filesystem is located does not respond for a while.
2. When a storage server that holds a replica of the virtual disk
becomes unreachable, the Gluster client (qemu+libgfapi) waits for
max. network.ping-timeout seconds before it resumes I/O.
Once a filesystem in a VM goes read-only, you might be able to fsck
and re-mount it read-writable again. It is not something a VM will
do by itself.
$ cat /sys/block/sda/device/timeout 30
30 seconds is the default for SD-devices, and for testing you can
# echo 300 > /sys/block/sda/device/timeout
This is not a peristent change, you can create a udev-rule to apply
this change at bootup.
Some of the filesystem offer a mount option that can change the
behaviour after a disk error is detected. "man mount" shows the
"errors" option for ext*. Changing this to "continue" is not
recommended, "abort" or "panic" will be the most safe for your
data.
The timeout mentioned in (2) is for the Gluster Volume, and checked
by the client. When a client does a write to a replicated volume,
the write needs to be acknowledged by both/all replicas. The client
(libgfapi) delays the reply to the application (qemu) until
both/all replies from the replicas has been received. This delay is
configured as the volume option network.ping-timeout (42 seconds by
default).
Now, if the VM returns block errors after 30 seconds, and the
client waits up to 42 seconds for recovery, there is an issue...
So, your solution could be to increase the timeout for error
detection of the disks inside the VMs, and/or decrease the
network.ping-timeout.
It would be interesting to know if adapting these values prevents
the read-only occurrences in your environment. If you do any
testing with this, please keep me informed about the results.
Niels
- --
Mit freundlichen Grüßen
André Bauer

MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: ***@magix.net
***@magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
- ----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
- ----------------------------------------------------------------------
Niels de Vos
2015-10-28 09:25:09 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Niels,
my network.ping-timeout was already set to 5 seconds.
Unfortunately it seems i dont have the timout setting in Ubuntu 14.04
for my vda disk.
drwxr-xr-x 4 root root 0 Oct 26 20:21 ./
drwxr-xr-x 5 root root 0 Oct 26 20:21 ../
drwxr-xr-x 3 root root 0 Oct 26 20:21 block/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 device
lrwxrwxrwx 1 root root 0 Oct 27 18:13 driver ->
../../../../bus/virtio/drivers/virtio_blk/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 features
- -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias
drwxr-xr-x 2 root root 0 Oct 27 18:13 power/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 status
lrwxrwxrwx 1 root root 0 Oct 26 20:21 subsystem ->
../../../../bus/virtio/
- -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent
- -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor
Is the qourum setting a problem, if you only have 2 replicas?
cluster.quorum-type: auto
cluster.server-quorum-type: server
As i understand the documentation (
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A
dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html
), cluster.server-quorum-ratio is set to "< 50%" by default, which can
never happen if you only have 2 replicas and one node goes down, right?
Do in need cluster.server-quorum-ratio = 50% in this case?
Replica 2 for VM storage is troublesome. Sahine just responded very
nicely to a very similar email:

http://thread.gmane.org/gmane.comp.file-systems.gluster.user/22818/focus=22823

HTH,
Niels
Diego Remolina
2015-10-28 13:38:58 UTC
Permalink
I am running Ovirt and self-hosted engine with additional vms on a
replica two gluster volume. I have an "arbiter" node and set quorum
ratio to 51%. The arbiter node is just another machine with the
glusterfs bits installed that is part of the gluster peers but has no
bricks to it.

You will have to be very careful where you put these three machines if
they are going to go in separate server rooms or buildings. There are
pros and cons to distribution of the nodes and network topology may
also influence that.

In my case, this is on a campus, I have machines in 3 separate
buildings and all machines are on the same main campus router (we have
more than one main router). All machines connected via 10 gbps. If I
had one node with bricks and the arbiter in the same building and that
building went down (power/AC/chill water/network), then the other node
with bricks would be useless. This is why I have machines in 3
different buildings. Oh, and this is because most of the client
systems are not even in the same building as the servers. If my client
machines and servers where in the same building, then doing one node
with bricks and arbiter in that same building could make sense.

HTH,

Diego
Post by Niels de Vos
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Niels,
my network.ping-timeout was already set to 5 seconds.
Unfortunately it seems i dont have the timout setting in Ubuntu 14.04
for my vda disk.
drwxr-xr-x 4 root root 0 Oct 26 20:21 ./
drwxr-xr-x 5 root root 0 Oct 26 20:21 ../
drwxr-xr-x 3 root root 0 Oct 26 20:21 block/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 device
lrwxrwxrwx 1 root root 0 Oct 27 18:13 driver ->
../../../../bus/virtio/drivers/virtio_blk/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 features
- -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias
drwxr-xr-x 2 root root 0 Oct 27 18:13 power/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 status
lrwxrwxrwx 1 root root 0 Oct 26 20:21 subsystem ->
../../../../bus/virtio/
- -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent
- -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor
Is the qourum setting a problem, if you only have 2 replicas?
cluster.quorum-type: auto
cluster.server-quorum-type: server
As i understand the documentation (
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A
dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html
), cluster.server-quorum-ratio is set to "< 50%" by default, which can
never happen if you only have 2 replicas and one node goes down, right?
Do in need cluster.server-quorum-ratio = 50% in this case?
Replica 2 for VM storage is troublesome. Sahine just responded very
http://thread.gmane.org/gmane.comp.file-systems.gluster.user/22818/focus=22823
HTH,
Niels
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
André Bauer
2015-11-02 17:54:34 UTC
Permalink
Thanks for the hints guys :-)

I think i will try to use an arbiter. As i use distributed / replicated
volumes i think i have to add 2 arbiters, right?

My nodes have 10GBit interfaces. Would be 1 GBit for the arbiter(s) enough?

Regards
André
Post by Diego Remolina
I am running Ovirt and self-hosted engine with additional vms on a
replica two gluster volume. I have an "arbiter" node and set quorum
ratio to 51%. The arbiter node is just another machine with the
glusterfs bits installed that is part of the gluster peers but has no
bricks to it.
You will have to be very careful where you put these three machines if
they are going to go in separate server rooms or buildings. There are
pros and cons to distribution of the nodes and network topology may
also influence that.
In my case, this is on a campus, I have machines in 3 separate
buildings and all machines are on the same main campus router (we have
more than one main router). All machines connected via 10 gbps. If I
had one node with bricks and the arbiter in the same building and that
building went down (power/AC/chill water/network), then the other node
with bricks would be useless. This is why I have machines in 3
different buildings. Oh, and this is because most of the client
systems are not even in the same building as the servers. If my client
machines and servers where in the same building, then doing one node
with bricks and arbiter in that same building could make sense.
HTH,
Diego
Post by Niels de Vos
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Niels,
my network.ping-timeout was already set to 5 seconds.
Unfortunately it seems i dont have the timout setting in Ubuntu 14.04
for my vda disk.
drwxr-xr-x 4 root root 0 Oct 26 20:21 ./
drwxr-xr-x 5 root root 0 Oct 26 20:21 ../
drwxr-xr-x 3 root root 0 Oct 26 20:21 block/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 device
lrwxrwxrwx 1 root root 0 Oct 27 18:13 driver ->
../../../../bus/virtio/drivers/virtio_blk/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 features
- -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias
drwxr-xr-x 2 root root 0 Oct 27 18:13 power/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 status
lrwxrwxrwx 1 root root 0 Oct 26 20:21 subsystem ->
../../../../bus/virtio/
- -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent
- -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor
Is the qourum setting a problem, if you only have 2 replicas?
cluster.quorum-type: auto
cluster.server-quorum-type: server
As i understand the documentation (
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A
dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html
), cluster.server-quorum-ratio is set to "< 50%" by default, which can
never happen if you only have 2 replicas and one node goes down, right?
Do in need cluster.server-quorum-ratio = 50% in this case?
Replica 2 for VM storage is troublesome. Sahine just responded very
http://thread.gmane.org/gmane.comp.file-systems.gluster.user/22818/focus=22823
HTH,
Niels
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Mit freundlichen Grüßen
André Bauer

MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: ***@magix.net
***@magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
----------------------------------------------------------------------
Steve Dainard
2015-11-02 22:48:49 UTC
Permalink
I wouldn't think you'd need any 'arbiter' nodes (in quotes because in 3.7+
there is an actual arbiter node at the volume level). You have 4 nodes, and
if you lose 1, you're at 3/4 or 75%.

Personally I've not had much luck with 2 node (with or without the fake
arbiter node) as storage for Ovirt VM's. I found I ran into a slew of
storage domain failure issues (no data loss), hanging VM's etc. Instead I
went with a replica 3 volume for just VM storage (3 x 1TB SSD's) and bulk
storage is distributed replica 2.

I found when a node in a replica pair goes down and is timing out there was
zero IO (no read, no write). After a timeout I end up with a readonly
filesystem for whatever data was stored on that replica pair. Not very
useful for something stateful like a VM. The only way to get write access
was to get the failed node up and running, and usually the VM's in Ovirt
ended up in a 'paused' state that couldn't be recovered from.

I also testing volume level arbiter (replica 2 arbiter 1) with gluster
3.7.3 before going to 3.6.6 and replica 3, and found IO was too slow for my
environment. The bug report I filed is here for some write speed
references: https://bugzilla.redhat.com/show_bug.cgi?id=1255110

In any case I'd stick with a stable release of Gluster, and try to get
replica 3 for VM storage if you can.
Post by André Bauer
Thanks for the hints guys :-)
I think i will try to use an arbiter. As i use distributed / replicated
volumes i think i have to add 2 arbiters, right?
My nodes have 10GBit interfaces. Would be 1 GBit for the arbiter(s) enough?
Regards
André
Post by Diego Remolina
I am running Ovirt and self-hosted engine with additional vms on a
replica two gluster volume. I have an "arbiter" node and set quorum
ratio to 51%. The arbiter node is just another machine with the
glusterfs bits installed that is part of the gluster peers but has no
bricks to it.
You will have to be very careful where you put these three machines if
they are going to go in separate server rooms or buildings. There are
pros and cons to distribution of the nodes and network topology may
also influence that.
In my case, this is on a campus, I have machines in 3 separate
buildings and all machines are on the same main campus router (we have
more than one main router). All machines connected via 10 gbps. If I
had one node with bricks and the arbiter in the same building and that
building went down (power/AC/chill water/network), then the other node
with bricks would be useless. This is why I have machines in 3
different buildings. Oh, and this is because most of the client
systems are not even in the same building as the servers. If my client
machines and servers where in the same building, then doing one node
with bricks and arbiter in that same building could make sense.
HTH,
Diego
Post by Niels de Vos
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Niels,
my network.ping-timeout was already set to 5 seconds.
Unfortunately it seems i dont have the timout setting in Ubuntu 14.04
for my vda disk.
drwxr-xr-x 4 root root 0 Oct 26 20:21 ./
drwxr-xr-x 5 root root 0 Oct 26 20:21 ../
drwxr-xr-x 3 root root 0 Oct 26 20:21 block/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 device
lrwxrwxrwx 1 root root 0 Oct 27 18:13 driver ->
../../../../bus/virtio/drivers/virtio_blk/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 features
- -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias
drwxr-xr-x 2 root root 0 Oct 27 18:13 power/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 status
lrwxrwxrwx 1 root root 0 Oct 26 20:21 subsystem ->
../../../../bus/virtio/
- -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent
- -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor
Is the qourum setting a problem, if you only have 2 replicas?
cluster.quorum-type: auto
cluster.server-quorum-type: server
As i understand the documentation (
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A
Post by Diego Remolina
Post by Niels de Vos
dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html
), cluster.server-quorum-ratio is set to "< 50%" by default, which can
never happen if you only have 2 replicas and one node goes down, right?
Do in need cluster.server-quorum-ratio = 50% in this case?
Replica 2 for VM storage is troublesome. Sahine just responded very
http://thread.gmane.org/gmane.comp.file-systems.gluster.user/22818/focus=22823
Post by Diego Remolina
Post by Niels de Vos
HTH,
Niels
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
--
Mit freundlichen GrÌßen
André Bauer
MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY
tel.: 0351 41884875
www.magix.com <http://www.magix.com/>
GeschÀftsfÌhrer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205
<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
----------------------------------------------------------------------
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Continue reading on narkive:
Loading...