Karli Sjöberg
2018-08-10 12:08:57 UTC
Hey all!
I am playing around on my computer with setting up a virtual mini-
cluster of five VM's:
1x router
1x client
3x Gluster/NFS-Ganesha servers
The router is pfSense, the client is Xubuntu 18.04 and the servers are
CentOS 7.5.
I set up the cluster using 'gdeploy' with configuration snippets taken
from oVirt/Cockpit HCI setup and another snippet for setting up the
NFS-Ganesha part of it. The configuration is successful apart from some
minor details I debugged but I'm fairly sure I haven't made any obvious
misses.
All of the VM's are registered in pfSense's DNS, as well as the VIP's
for the NFS-Ganesha nodes, which works great and the client have no
issues with resolving any of the names.
hv01.localdomain 192.168.1.101
hv02.localdomain 192.168.1.102
hv03.localdomain 192.168.1.103
hv01v.localdomain 192.168.1.110
hv02v.localdomain 192.168.1.111
hv03v.localdomain 192.168.1.112
The cluster status is HEALTHY accoring to
'/usr/libexec/ganesha/ganesha-ha.sh' before I start my tests:
client# mount -t nfs -o vers=4.1 hv01v.localdomain:/data /mnt
client# dd if=/dev/urandom of=/var/tmp/test.bin bs=1M count=1024
client# while true; do rsync /var/tmp/test.bin /mnt/; rm -f
/mnt/test.bin; done
Then after a while, the 'nfs-ganesha' service unexpectedly dies and
doesn't restart by itself. The copy loop gets picked up after a while
on 'hv02' until history repeats itself until all of the nodes' 'nfs-
ganesha' services are dead.
With normal logs activated, the dead node says nothing before dying;
sudden heart attack syndrome- so no clues there, and ones remaining
only says they've taken over...
Right now I'm running with FULL_DEBUG which makes testing very
difficult since the throughput is down to a crawl. Nothing strange
about that, just takes a lot more time to provoke.
Please don't hesitate to ask for more information in case there's
something else you'd like me to share!
I'm hoping someone recognizes this behaviour and knows what I'm doing
wrong:)
glusterfs-client-xlators-3.10.12-1.el7.x86_64
glusterfs-api-3.10.12-1.el7.x86_64
nfs-ganesha-2.4.5-1.el7.x86_64
centos-release-gluster310-1.0-1.el7.centos.noarch
glusterfs-3.10.12-1.el7.x86_64
glusterfs-cli-3.10.12-1.el7.x86_64
nfs-ganesha-gluster-2.4.5-1.el7.x86_64
glusterfs-server-3.10.12-1.el7.x86_64
glusterfs-libs-3.10.12-1.el7.x86_64
glusterfs-fuse-3.10.12-1.el7.x86_64
glusterfs-ganesha-3.10.12-1.el7.x86_64
Thanks in advance!
/K
I am playing around on my computer with setting up a virtual mini-
cluster of five VM's:
1x router
1x client
3x Gluster/NFS-Ganesha servers
The router is pfSense, the client is Xubuntu 18.04 and the servers are
CentOS 7.5.
I set up the cluster using 'gdeploy' with configuration snippets taken
from oVirt/Cockpit HCI setup and another snippet for setting up the
NFS-Ganesha part of it. The configuration is successful apart from some
minor details I debugged but I'm fairly sure I haven't made any obvious
misses.
All of the VM's are registered in pfSense's DNS, as well as the VIP's
for the NFS-Ganesha nodes, which works great and the client have no
issues with resolving any of the names.
hv01.localdomain 192.168.1.101
hv02.localdomain 192.168.1.102
hv03.localdomain 192.168.1.103
hv01v.localdomain 192.168.1.110
hv02v.localdomain 192.168.1.111
hv03v.localdomain 192.168.1.112
The cluster status is HEALTHY accoring to
'/usr/libexec/ganesha/ganesha-ha.sh' before I start my tests:
client# mount -t nfs -o vers=4.1 hv01v.localdomain:/data /mnt
client# dd if=/dev/urandom of=/var/tmp/test.bin bs=1M count=1024
client# while true; do rsync /var/tmp/test.bin /mnt/; rm -f
/mnt/test.bin; done
Then after a while, the 'nfs-ganesha' service unexpectedly dies and
doesn't restart by itself. The copy loop gets picked up after a while
on 'hv02' until history repeats itself until all of the nodes' 'nfs-
ganesha' services are dead.
With normal logs activated, the dead node says nothing before dying;
sudden heart attack syndrome- so no clues there, and ones remaining
only says they've taken over...
Right now I'm running with FULL_DEBUG which makes testing very
difficult since the throughput is down to a crawl. Nothing strange
about that, just takes a lot more time to provoke.
Please don't hesitate to ask for more information in case there's
something else you'd like me to share!
I'm hoping someone recognizes this behaviour and knows what I'm doing
wrong:)
glusterfs-client-xlators-3.10.12-1.el7.x86_64
glusterfs-api-3.10.12-1.el7.x86_64
nfs-ganesha-2.4.5-1.el7.x86_64
centos-release-gluster310-1.0-1.el7.centos.noarch
glusterfs-3.10.12-1.el7.x86_64
glusterfs-cli-3.10.12-1.el7.x86_64
nfs-ganesha-gluster-2.4.5-1.el7.x86_64
glusterfs-server-3.10.12-1.el7.x86_64
glusterfs-libs-3.10.12-1.el7.x86_64
glusterfs-fuse-3.10.12-1.el7.x86_64
glusterfs-ganesha-3.10.12-1.el7.x86_64
Thanks in advance!
/K