Discussion:
[Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
Greg Scott
2013-07-09 01:17:08 UTC
Permalink
I don't get this. I have a replicated volume and 2 nodes. My challenge is, when I take one node offline, the other node can no longer access the volume until both nodes are back online again.

Details:

I have 2 nodes, fw1 and fw2. Each node has an XFS file system, /gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.

I create a gluster volume named firewall-scripts which is a replica of those two XFS file systems. The volume holds a bunch of config files common to both fw1 and fw2. The application is an active/standby pair of firewalls and the idea is to keep config files in a gluster volume.

When both nodes are online, everything works as expected. But when I take either node offline, node fw2 behaves badly:

[***@chicago-fw2 ~]# ls /firewall-scripts
ls: cannot access /firewall-scripts: Transport endpoint is not connected

And when I bring the offline node back online, node fw2 eventually behaves normally again.

What's up with that? Gluster is supposed to be resilient and self-healing and able to stand up to this sort of abuse. So I must be doing something wrong.

Here is how I set up everything - it doesn't get much simpler than this and my setup is right out the Getting Started Guide but using my own names.

Here are the steps I followed, all from fw1:

gluster peer probe 192.168.253.2
gluster peer status

Create and start the volume:

gluster volume create firewall-scripts replica 2 transport tcp 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts

On fw1:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

on fw2:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

That's it. That's the whole setup. When both nodes are online, everything replicates beautifully. But take one node offline and it all falls apart.

Here is the output from gluster volume info, identical on both nodes:

[***@chicago-fw1 etc]# gluster volume info

Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[***@chicago-fw1 etc]#

Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors like this every couple of seconds:

[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not connected)

And then when I bring fw1 back online, I see these messages on fw2:

[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-09 01:01:35.019441] I [client-handshake.c:1308:client_post_handshake] 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are re-opened
[2013-07-09 01:01:35.020070] I [client-handshake.c:930:client_child_up_reopen_done] 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-09 01:01:35.020616] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1

So how do I make glusterfs survive a node failure, which is the whole point of all this?

thanks

- Greg Scott
Greg Scott
2013-07-09 17:36:59 UTC
Permalink
No takers? I am running gluster 3.4beta3 that came with Fedora 19. Is my issue a consequence of some kind of quorum split-brain thing?

thanks


- Greg Scott

From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
To: 'gluster-***@gluster.org'
Subject: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

I don't get this. I have a replicated volume and 2 nodes. My challenge is, when I take one node offline, the other node can no longer access the volume until both nodes are back online again.

Details:

I have 2 nodes, fw1 and fw2. Each node has an XFS file system, /gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.

I create a gluster volume named firewall-scripts which is a replica of those two XFS file systems. The volume holds a bunch of config files common to both fw1 and fw2. The application is an active/standby pair of firewalls and the idea is to keep config files in a gluster volume.

When both nodes are online, everything works as expected. But when I take either node offline, node fw2 behaves badly:

[***@chicago-fw2 ~]# ls /firewall-scripts
ls: cannot access /firewall-scripts: Transport endpoint is not connected

And when I bring the offline node back online, node fw2 eventually behaves normally again.

What's up with that? Gluster is supposed to be resilient and self-healing and able to stand up to this sort of abuse. So I must be doing something wrong.

Here is how I set up everything - it doesn't get much simpler than this and my setup is right out the Getting Started Guide but using my own names.

Here are the steps I followed, all from fw1:

gluster peer probe 192.168.253.2
gluster peer status

Create and start the volume:

gluster volume create firewall-scripts replica 2 transport tcp 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts

On fw1:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

on fw2:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

That's it. That's the whole setup. When both nodes are online, everything replicates beautifully. But take one node offline and it all falls apart.

Here is the output from gluster volume info, identical on both nodes:

[***@chicago-fw1 etc]# gluster volume info

Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[***@chicago-fw1 etc]#

Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors like this every couple of seconds:

[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not connected)

And then when I bring fw1 back online, I see these messages on fw2:

[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-09 01:01:35.019441] I [client-handshake.c:1308:client_post_handshake] 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are re-opened
[2013-07-09 01:01:35.020070] I [client-handshake.c:930:client_child_up_reopen_done] 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-09 01:01:35.020616] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1

So how do I make glusterfs survive a node failure, which is the whole point of all this?

thanks

* Greg Scott
Greg Scott
2013-07-10 05:26:30 UTC
Permalink
Bummer. Looks like I'm on my own with this one.


- Greg

From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Tuesday, July 09, 2013 12:37 PM
To: 'gluster-***@gluster.org'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

No takers? I am running gluster 3.4beta3 that came with Fedora 19. Is my issue a consequence of some kind of quorum split-brain thing?

thanks


- Greg Scott

From: gluster-users-***@gluster.org<mailto:gluster-users-***@gluster.org> [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
To: 'gluster-***@gluster.org'
Subject: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

I don't get this. I have a replicated volume and 2 nodes. My challenge is, when I take one node offline, the other node can no longer access the volume until both nodes are back online again.

Details:

I have 2 nodes, fw1 and fw2. Each node has an XFS file system, /gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.

I create a gluster volume named firewall-scripts which is a replica of those two XFS file systems. The volume holds a bunch of config files common to both fw1 and fw2. The application is an active/standby pair of firewalls and the idea is to keep config files in a gluster volume.

When both nodes are online, everything works as expected. But when I take either node offline, node fw2 behaves badly:

[***@chicago-fw2 ~]# ls /firewall-scripts
ls: cannot access /firewall-scripts: Transport endpoint is not connected

And when I bring the offline node back online, node fw2 eventually behaves normally again.

What's up with that? Gluster is supposed to be resilient and self-healing and able to stand up to this sort of abuse. So I must be doing something wrong.

Here is how I set up everything - it doesn't get much simpler than this and my setup is right out the Getting Started Guide but using my own names.

Here are the steps I followed, all from fw1:

gluster peer probe 192.168.253.2
gluster peer status

Create and start the volume:

gluster volume create firewall-scripts replica 2 transport tcp 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts

On fw1:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

on fw2:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

That's it. That's the whole setup. When both nodes are online, everything replicates beautifully. But take one node offline and it all falls apart.

Here is the output from gluster volume info, identical on both nodes:

[***@chicago-fw1 etc]# gluster volume info

Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[***@chicago-fw1 etc]#

Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors like this every couple of seconds:

[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not connected)

And then when I bring fw1 back online, I see these messages on fw2:

[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-09 01:01:35.019441] I [client-handshake.c:1308:client_post_handshake] 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are re-opened
[2013-07-09 01:01:35.020070] I [client-handshake.c:930:client_child_up_reopen_done] 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-09 01:01:35.020616] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1

So how do I make glusterfs survive a node failure, which is the whole point of all this?

thanks

* Greg Scott
Pranith Kumar Karampuri
2013-07-10 05:57:32 UTC
Permalink
hi Greg,
Could you let us know what are the logs that are appearing in fw1's mount's logs when fw2 is taken down. It would be nice if you could get us all the logs(tarball may be?) on fw1 when fw2 is taken down.

Pranith.

----- Original Message -----
Sent: Wednesday, July 10, 2013 10:56:30 AM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
Bummer. Looks like I’m on my own with this one.
- Greg
Sent: Tuesday, July 09, 2013 12:37 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see
the replicated volume anymore
No takers? I am running gluster 3.4beta3 that came with Fedora 19. Is my
issue a consequence of some kind of quorum split-brain thing?
thanks
- Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
Subject: [Gluster-users] One node goes offline, the other node can't see the
replicated volume anymore
I don’t get this. I have a replicated volume and 2 nodes. My challenge is,
when I take one node offline, the other node can no longer access the volume
until both nodes are back online again.
I have 2 nodes, fw1 and fw2. Each node has an XFS file system, /gluster-fw1
on node fw1 and gluster-fw2 no node fw2. Node fw1 is at IP Address
192.168.253.1. Node fw2 is at 192.168.253.2.
I create a gluster volume named firewall-scripts which is a replica of those
two XFS file systems. The volume holds a bunch of config files common to
both fw1 and fw2. The application is an active/standby pair of firewalls and
the idea is to keep config files in a gluster volume.
When both nodes are online, everything works as expected. But when I take
ls: cannot access /firewall-scripts: Transport endpoint is not connected
And when I bring the offline node back online, node fw2 eventually behaves normally again.
What’s up with that? Gluster is supposed to be resilient and self-healing and
able to stand up to this sort of abuse. So I must be doing something wrong.
Here is how I set up everything – it doesn’t get much simpler than this and
my setup is right out the Getting Started Guide but using my own names.
gluster peer probe 192.168.253.2
gluster peer status
gluster volume create firewall-scripts replica 2 transport tcp
192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
That’s it. That’s the whole setup. When both nodes are online, everything
replicates beautifully. But take one node offline and it all falls apart.
Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors like
[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init]
0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk]
0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not
connected)
[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig]
0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv]
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I
[client-handshake.c:1658:select_server_supported_programs]
0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437),
Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk]
0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to
remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk]
0-firewall-scripts-client-0: Server and Client lk-version numbers are not
same, reopening the fds
[2013-07-09 01:01:35.019441] I
1 fds open - Delaying child_up until they are re-opened
[2013-07-09 01:01:35.020070] I
[client-handshake.c:930:client_child_up_reopen_done]
0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying
CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify]
0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came
back up; going online.
[2013-07-09 01:01:35.020616] I
[client-handshake.c:450:client_set_lk_version_cbk]
0-firewall-scripts-client-0: Server lk version = 1
So how do I make glusterfs survive a node failure, which is the whole point of all this?
thanks
· Greg Scott
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Frank Sonntag
2013-07-10 06:08:06 UTC
Permalink
Hi Greg,

Try using the same server on both machines when mounting, instead of mounting off the local gluster server on both.
I've used the same approach like you in the past and got into all kinds of split-brain problems.
The drawback of course is that mounts will fail if the machine you chose is not available at mount time. It's one of my gripes with gluster that you cannot list more than one server in your mount command.

Frank
Bummer. Looks like I’m on my own with this one.
- Greg
Sent: Tuesday, July 09, 2013 12:37 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
No takers? I am running gluster 3.4beta3 that came with Fedora 19. Is my issue a consequence of some kind of quorum split-brain thing?
thanks
- Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
Subject: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
I don’t get this. I have a replicated volume and 2 nodes. My challenge is, when I take one node offline, the other node can no longer access the volume until both nodes are back online again.
I have 2 nodes, fw1 and fw2. Each node has an XFS file system, /gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.
I create a gluster volume named firewall-scripts which is a replica of those two XFS file systems. The volume holds a bunch of config files common to both fw1 and fw2. The application is an active/standby pair of firewalls and the idea is to keep config files in a gluster volume.
ls: cannot access /firewall-scripts: Transport endpoint is not connected
And when I bring the offline node back online, node fw2 eventually behaves normally again.
What’s up with that? Gluster is supposed to be resilient and self-healing and able to stand up to this sort of abuse. So I must be doing something wrong.
Here is how I set up everything – it doesn’t get much simpler than this and my setup is right out the Getting Started Guide but using my own names.
gluster peer probe 192.168.253.2
gluster peer status
gluster volume create firewall-scripts replica 2 transport tcp 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
That’s it. That’s the whole setup. When both nodes are online, everything replicates beautifully. But take one node offline and it all falls apart.
Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-09 01:01:35.019441] I [client-handshake.c:1308:client_post_handshake] 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are re-opened
[2013-07-09 01:01:35.020070] I [client-handshake.c:930:client_child_up_reopen_done] 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-09 01:01:35.020616] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
So how do I make glusterfs survive a node failure, which is the whole point of all this?
thanks
· Greg Scott
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Rejy M Cyriac
2013-07-10 07:59:42 UTC
Permalink
Post by Frank Sonntag
Hi Greg,
Try using the same server on both machines when mounting, instead of mounting off the local gluster server on both.
I've used the same approach like you in the past and got into all kinds of split-brain problems.
The drawback of course is that mounts will fail if the machine you chose is not available at mount time. It's one of my gripes with gluster that you cannot list more than one server in your mount command.
Frank
Would not the mount option 'backupvolfile-server=<secondary server> help
at mount time, in the case of the primary server not being available ?

- rejy (rmc)
Post by Frank Sonntag
Bummer. Looks like I’m on my own with this one.
- Greg
Sent: Tuesday, July 09, 2013 12:37 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
No takers? I am running gluster 3.4beta3 that came with Fedora 19. Is my issue a consequence of some kind of quorum split-brain thing?
thanks
- Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
Subject: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
I don’t get this. I have a replicated volume and 2 nodes. My challenge is, when I take one node offline, the other node can no longer access the volume until both nodes are back online again.
I have 2 nodes, fw1 and fw2. Each node has an XFS file system, /gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.
I create a gluster volume named firewall-scripts which is a replica of those two XFS file systems. The volume holds a bunch of config files common to both fw1 and fw2. The application is an active/standby pair of firewalls and the idea is to keep config files in a gluster volume.
ls: cannot access /firewall-scripts: Transport endpoint is not connected
And when I bring the offline node back online, node fw2 eventually behaves normally again.
What’s up with that? Gluster is supposed to be resilient and self-healing and able to stand up to this sort of abuse. So I must be doing something wrong.
Here is how I set up everything – it doesn’t get much simpler than this and my setup is right out the Getting Started Guide but using my own names.
gluster peer probe 192.168.253.2
gluster peer status
gluster volume create firewall-scripts replica 2 transport tcp 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
That’s it. That’s the whole setup. When both nodes are online, everything replicates beautifully. But take one node offline and it all falls apart.
Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-09 01:01:35.019441] I [client-handshake.c:1308:client_post_handshake] 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are re-opened
[2013-07-09 01:01:35.020070] I [client-handshake.c:930:client_child_up_reopen_done] 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-09 01:01:35.020616] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
So how do I make glusterfs survive a node failure, which is the whole point of all this?
thanks
· Greg Scott
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Frank Sonntag
2013-07-10 08:50:56 UTC
Permalink
Post by Rejy M Cyriac
Post by Frank Sonntag
Hi Greg,
Try using the same server on both machines when mounting, instead of mounting off the local gluster server on both.
I've used the same approach like you in the past and got into all kinds of split-brain problems.
The drawback of course is that mounts will fail if the machine you chose is not available at mount time. It's one of my gripes with gluster that you cannot list more than one server in your mount command.
Frank
Would not the mount option 'backupvolfile-server=<secondary server> help
at mount time, in the case of the primary server not being available ?
- rejy (rmc)
I am still on 3.2 which does not have that option (as far as I know).
But thanks for bringing this up. Useful to know.
And the OP can make use of it of course.


Frank
Post by Rejy M Cyriac
Post by Frank Sonntag
Bummer. Looks like I’m on my own with this one.
- Greg
Sent: Tuesday, July 09, 2013 12:37 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
No takers? I am running gluster 3.4beta3 that came with Fedora 19. Is my issue a consequence of some kind of quorum split-brain thing?
thanks
- Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
Subject: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
I don’t get this. I have a replicated volume and 2 nodes. My challenge is, when I take one node offline, the other node can no longer access the volume until both nodes are back online again.
I have 2 nodes, fw1 and fw2. Each node has an XFS file system, /gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.
I create a gluster volume named firewall-scripts which is a replica of those two XFS file systems. The volume holds a bunch of config files common to both fw1 and fw2. The application is an active/standby pair of firewalls and the idea is to keep config files in a gluster volume.
ls: cannot access /firewall-scripts: Transport endpoint is not connected
And when I bring the offline node back online, node fw2 eventually behaves normally again.
What’s up with that? Gluster is supposed to be resilient and self-healing and able to stand up to this sort of abuse. So I must be doing something wrong.
Here is how I set up everything – it doesn’t get much simpler than this and my setup is right out the Getting Started Guide but using my own names.
gluster peer probe 192.168.253.2
gluster peer status
gluster volume create firewall-scripts replica 2 transport tcp 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
That’s it. That’s the whole setup. When both nodes are online, everything replicates beautifully. But take one node offline and it all falls apart.
Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-09 01:01:35.019441] I [client-handshake.c:1308:client_post_handshake] 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are re-opened
[2013-07-09 01:01:35.020070] I [client-handshake.c:930:client_child_up_reopen_done] 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-09 01:01:35.020616] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
So how do I make glusterfs survive a node failure, which is the whole point of all this?
thanks
· Greg Scott
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Greg Scott
2013-07-10 11:10:26 UTC
Permalink
Brian, I'm not ready to give up just yet.
Brian Candler
2013-07-10 07:35:30 UTC
Permalink
Bummer. Looks like I'm on my own with this one.
I'm afraid this is the problem with gluster: everything works great on
the happy path, but as soon as anything goes wrong, you're stuffed.
There is neither recovery procedure documentation, nor detailled
internals documentation (so you could work out for yourself what is
going on and fix it). In my opinion, gluster is unsupportable in its
current form.

For this reason I have recently stripped gluster out of a production
network. We're back to using simple NFSv4 at the moment. At some point I
will be evaluating ceph and maybe swift.

As you've observed, there's little point in having "resilient" copies of
data if they are not retrievable in error scenarios.

Regards,

Brian.
Joe Julian
2013-07-11 15:36:45 UTC
Permalink
This is unhelpful and trolling. Please refrain from this behavior.
Post by Brian Candler
Bummer. Looks like I'm on my own with this one.
I'm afraid this is the problem with gluster: everything works great on
the happy path, but as soon as anything goes wrong, you're stuffed.
There is neither recovery procedure documentation, nor detailled
internals documentation (so you could work out for yourself what is
going on and fix it). In my opinion, gluster is unsupportable in its
current form.
For this reason I have recently stripped gluster out of a production
network. We're back to using simple NFSv4 at the moment. At some point I
will be evaluating ceph and maybe swift.
As you've observed, there's little point in having "resilient" copies of
data if they are not retrievable in error scenarios.
Regards,
Brian.
------------------------------------------------------------------------
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
John Mark Walker
2013-07-11 15:44:23 UTC
Permalink
----- Original Message -----
Post by Joe Julian
This is unhelpful and trolling. Please refrain from this behavior.
Er, not so fast :)

If bad things happen to our users, I want to create the kind of environment where they feel like they can discuss it openly. Things happen, and we need to know about it.

The time to crack down would be if someone posts stuff that is off-topic or continuously rants without acknowledging current efforts. Look, he said it didn't work for him and he's evaluating other things - I don't have a problem with that.

I ask only two things in this area:

1. if future versions attempt to solve your specific issues that you acknowledge the effort and give it a try.
2. once you've said your piece, ie. "this doesn't work for me" and you decide to leave for other pastures, please don't continue to post stuff here that reflects on old releases and out-of-date information.

-JM
Joe Julian
2013-07-11 17:03:11 UTC
Permalink
The trolling is the "you're on your own" bit. We have a very helpful community.

The only questions that go unanswered on irc are the ones where the asker left.

On email, I suspect it's when the information provided stumps the readers, or some people (myself included) tend not to respond when the email has no information at all. (Not saying his had no info, just cataloging.)
Post by Pranith Kumar Karampuri
----- Original Message -----
Post by Joe Julian
This is unhelpful and trolling. Please refrain from this behavior.
Er, not so fast :)
If bad things happen to our users, I want to create the kind of
environment where they feel like they can discuss it openly. Things
happen, and we need to know about it.
The time to crack down would be if someone posts stuff that is
off-topic or continuously rants without acknowledging current efforts.
Look, he said it didn't work for him and he's evaluating other things -
I don't have a problem with that.
1. if future versions attempt to solve your specific issues that you
acknowledge the effort and give it a try.
2. once you've said your piece, ie. "this doesn't work for me" and you
decide to leave for other pastures, please don't continue to post stuff
here that reflects on old releases and out-of-date information.
-JM
Greg Scott
2013-07-11 19:01:23 UTC
Permalink
Well
OK. The “on my own” comment came from me after a long time and a lot of work trying to figure this out. I just went back and checked – I posted the original question on 7/8 at 8:18 PM. I asked a follow-up question 7/9 at 12:27 PM, roughly 16 hours later. And then the “On my own” post was on 7/10 at 12:27 AM, or around 28 hours after my original question. I was feeling kind of lonely and, well, on my own at the time. All times are USA Central time. I do work weird hours.

I certainly don’t mean to be a troll and even after all these years, I still don’t know what a troll is. All I know is, I need help with this issue and I appreciate the advice so far. And, frankly, without community help solving or mitigating this problem, I can’t use Gluster for my HA application because the behavior I observed creates 2 single points of failure instead of eliminating a single point of failure with redundancy. Which creates a serious headache and I would think a problem the whole community would want to overcome.

I gave the best info I know how to give, and I did a bunch of work to try to characterize the problem and take my application out of the mix. If I can do more or provide more info, just tell me what I can provide while I have everything sitting here in a testbed.
My challenge now is, this project was supposed to be delivered several days ago. I’ll be out of town tomorrow so I may not be able to get back to it until Saturday. I just don’t feel good delivering this system until I can test it some more and understand what’s going on with the issue I stumbled upon.


- Greg


From: Joe Julian [mailto:***@julianfamily.org]
Sent: Thursday, July 11, 2013 12:03 PM
To: John Mark Walker
Cc: Brian Candler; Greg Scott; gluster-***@gluster.org
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

The trolling is the "you're on your own" bit. We have a very helpful community.

The only questions that go unanswered on irc are the ones where the asker left.

On email, I suspect it's when the information provided stumps the readers, or some people (myself included) tend not to respond when the email has no information at all. (Not saying his had no info, just cataloging.)
John Mark Walker <***@gluster.org<mailto:***@gluster.org>> wrote:


________________________________
This is unhelpful and trolling. Please refrain from this behavior.

Er, not so fast :)

If bad things happen to our users, I want to create the kind of environment where they feel like they can discuss it openly. Things happen, and we need to know about it.

The time to crack down would be if someone posts stuff that is off-topic or continuously rants without acknowledging current efforts. Look, he said it didn't work for him and he's evaluating other things - I don't have a problem with that.

I ask only two things in this area:

1. if future versions attempt to solve your specific issues that you acknowledge the effort and give it a try.
2. once you've said your piece, ie. "this doesn't work for me" and you decide to leave for other pastures, please don't continue to post stuff here that reflects on old releases and out-of-date information.

-JM
John Mark Walker
2013-07-11 19:07:27 UTC
Permalink
No worries, Greg. Your posts have been fine - don't worry about it.

Thanks,
JM

----- Original Message -----
Well
OK. The “on my own” comment came from me after a long time and a lot of
work trying to figure this out. I just went back and checked – I posted the
original question on 7/8 at 8:18 PM. I asked a follow-up question 7/9 at
12:27 PM, roughly 16 hours later. And then the “On my own” post was on 7/10
at 12:27 AM, or around 28 hours after my original question. I was feeling
kind of lonely and, well, on my own at the time. All times are USA Central
time. I do work weird hours.
I certainly don’t mean to be a troll and even after all these years, I still
don’t know what a troll is. All I know is, I need help with this issue and I
appreciate the advice so far. And, frankly, without community help solving
or mitigating this problem, I can’t use Gluster for my HA application
because the behavior I observed creates 2 single points of failure instead
of eliminating a single point of failure with redundancy. Which creates a
serious headache and I would think a problem the whole community would want
to overcome.
I gave the best info I know how to give, and I did a bunch of work to try to
characterize the problem and take my application out of the mix. If I can do
more or provide more info, just tell me what I can provide while I have
everything sitting here in a testbed.
My challenge now is, this project was supposed to be delivered several days
ago. I’ll be out of town tomorrow so I may not be able to get back to it
until Saturday. I just don’t feel good delivering this system until I can
test it some more and understand what’s going on with the issue I stumbled
upon.
- Greg
Sent: Thursday, July 11, 2013 12:03 PM
To: John Mark Walker
Subject: Re: [Gluster-users] One node goes offline, the other node can't see
the replicated volume anymore
The trolling is the "you're on your own" bit. We have a very helpful community.
The only questions that go unanswered on irc are the ones where the asker left.
On email, I suspect it's when the information provided stumps the readers, or
some people (myself included) tend not to respond when the email has no
information at all. (Not saying his had no info, just cataloging.)
----- Original Message -----
Post by Joe Julian
This is unhelpful and trolling. Please refrain from this behavior.
Er, not so fast :)
If bad things happen to our users, I want to create the kind of environment
where they feel like they can discuss it openly. Things happen, and we need
to know about it.
The time to crack down would be if someone posts stuff that is off-topic or
continuously rants without acknowledging current efforts. Look, he said it
didn't work for him and he's evaluating other things - I don't have a
problem with that.
1. if future versions attempt to solve your specific issues that you
acknowledge the effort and give it a try.
2. once you've said your piece, ie. "this doesn't work for me" and you decide
to leave for other pastures, please don't continue to post stuff here that
reflects on old releases and out-of-date information.
-JM
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Brian Candler
2013-07-11 20:04:23 UTC
Permalink
Well
OK. The “on my own” comment came from me after a long time and a
lot of work trying to figure this out. I just went back and checked –
I posted the original question on 7/8 at 8:18 PM. I asked a follow-up
question 7/9 at 12:27 PM, roughly 16 hours later. And then the “On my
own” post was on 7/10 at 12:27 AM, or around 28 hours after my
original question. I was feeling kind of lonely and, well, on my own
at the time. All times are USA Central time. I do work weird hours.
I believe the trolling comment was aimed at me, although what I actually
said was less polite than "you're on your own" :-)

I take the point that there is an active and helpful gluster support
community, but IMO that is papering over the underlying cracks. Any data
storage system must be built to hand the error cases first and foremost
- performance and features are secondary. What's the point of having any
sort of redundant system which happily copies your data when everything
is working fine, but then gets itself into a twist when there's a problem?

Without the documentation of the algorithms and states, there's no way
to dig yourself out manually either. Gluster used to have a half-decent
wiki in the 2.x days, but that was orphaned, and now basically the
technical internal documentations consists of a couple of blog postings.

You have the source code of course, but for various pluggability and
performance reasons, gluster's source is hard to get one's head around
(I couldn't anyway). And even then, it shouldn't be the norm to have to
read source code when things break in a production environment. The
procedures should be documented - where to look, what you may see in
different scenarios, what actions you should take in those scenarios.

I apologise if my remarks came across as facetious, but to be honest, I
wish someone had pointed out the issues before I had started on the
journey into gluster and back out again. Since making that posting, I've
had a couple of private replies in support of what I said. Maybe it's
considered impolite to say anything negative about a project on that
project's mailing list.

I remain on the list, mostly lurking, to see if one day things improve.
However if someone will set up a gluster-ex-users mailing list, I'll
join that :-)

Regards,

Brian.
John Mark Walker
2013-07-11 20:13:48 UTC
Permalink
----- Original Message -----
I apologise if my remarks came across as facetious, but to be honest, I wish
someone had pointed out the issues before I had started on the journey into
gluster and back out again. Since making that posting, I've had a couple of
private replies in support of what I said. Maybe it's considered impolite to
say anything negative about a project on that project's mailing list.
I remain on the list, mostly lurking, to see if one day things improve.
However if someone will set up a gluster-ex-users mailing list, I'll join
that :-)
There will be room on this bandwagon when you change your mind :)
Joe Julian
2013-07-11 15:42:11 UTC
Permalink
When you first mount your volume, look in the client log and see if it's connecting to both bricks. I suspect it's not and that the failure is related to firewall settings.

Since your related post shows that the server processes are running when you're experiencing this problem, then it must be that the client is unable to make the necessary TCP connection, and is probably unable from the start.
Post by Greg Scott
Bummer. Looks like I'm on my own with this one.
- Greg
Sent: Tuesday, July 09, 2013 12:37 PM
Subject: Re: [Gluster-users] One node goes offline, the other node
can't see the replicated volume anymore
No takers? I am running gluster 3.4beta3 that came with Fedora 19.
Is my issue a consequence of some kind of quorum split-brain thing?
thanks
- Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
Subject: [Gluster-users] One node goes offline, the other node can't
see the replicated volume anymore
I don't get this. I have a replicated volume and 2 nodes. My
challenge is, when I take one node offline, the other node can no
longer access the volume until both nodes are back online again.
I have 2 nodes, fw1 and fw2. Each node has an XFS file system,
/gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at
IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.
I create a gluster volume named firewall-scripts which is a replica of
those two XFS file systems. The volume holds a bunch of config files
common to both fw1 and fw2. The application is an active/standby pair
of firewalls and the idea is to keep config files in a gluster volume.
When both nodes are online, everything works as expected. But when I
ls: cannot access /firewall-scripts: Transport endpoint is not
connected
And when I bring the offline node back online, node fw2 eventually behaves normally again.
What's up with that? Gluster is supposed to be resilient and
self-healing and able to stand up to this sort of abuse. So I must be
doing something wrong.
Here is how I set up everything - it doesn't get much simpler than this
and my setup is right out the Getting Started Guide but using my own
names.
gluster peer probe 192.168.253.2
gluster peer status
gluster volume create firewall-scripts replica 2 transport tcp
192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs
defaults,_netdev 0 0
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs
defaults,_netdev 0 0
That's it. That's the whole setup. When both nodes are online,
everything replicates beautifully. But take one node offline and it
all falls apart.
Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors
[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init]
0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk]
0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not
connected)
[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig]
0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv]
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I
[client-handshake.c:1658:select_server_supported_programs]
0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num
(1298437), Version (330)
[2013-07-09 01:01:35.019273] I
[client-handshake.c:1456:client_setvolume_cbk]
0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached
to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I
[client-handshake.c:1468:client_setvolume_cbk]
0-firewall-scripts-client-0: Server and Client lk-version numbers are
not same, reopening the fds
[2013-07-09 01:01:35.019441] I
[client-handshake.c:1308:client_post_handshake]
0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they
are re-opened
[2013-07-09 01:01:35.020070] I
[client-handshake.c:930:client_child_up_reopen_done]
0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd -
notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify]
0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0'
came back up; going online.
[2013-07-09 01:01:35.020616] I
[client-handshake.c:450:client_set_lk_version_cbk]
0-firewall-scripts-client-0: Server lk version = 1
So how do I make glusterfs survive a node failure, which is the whole point of all this?
thanks
* Greg Scott
------------------------------------------------------------------------
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Greg Scott
2013-07-11 16:46:43 UTC
Permalink
Post by Joe Julian
When you first mount your volume, look in the client log and see if it's connecting to both bricks.
I suspect it's not and that the failure is related to firewall settings.
Logs from both nodes below. For this test, first I did "umount /firewall-scripts" from both nodes. Then I did “mount –av” using the default parameters in my fstab file. I did **not** turn on the backupvolfile-server=<secondary server> for this test. And then in another window, I did "tail tail /var/log/glusterfs/firewall-scripts.log -f" and you can see the spot where I mounted my file system back up again.

Note that everything works as expected when both nodes are online, so this suggests everyone can see everyone else when things are steady-state. Also note that backupvolfile-server=<secondary server> changed the behavior - I documented this in an earlier post.
Post by Joe Julian
...the failure is related to firewall settings.
No way. I’m wide open on the interface I’m using for heartbeat and glusterfs. In my application, I take node fw1 offline by inserting a firewall rule and then getting rid of it a few seconds later. For testing right now, I just insert the rule by hand, look at a bunch of stuff, then get rid of it later. But since you brought it up, I cleaned out all firewall rules before doing and logging the mounts below. Near as I can tell, it looks like everyone can see everyone else. And the logs look the same to my eye as they did before I dropped all (not relevant) firewall rules.

Log from fw1:

[***@chicago-fw1 ~]#
[***@chicago-fw1 ~]# tail /var/log/glusterfs/firewall-scripts.log -f
[2013-07-11 15:51:54.423508] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-11 15:51:54.423576] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 15:51:54.440124] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 15:51:54.440660] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 15:51:54.440886] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 15:51:54.442235] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 15:51:54.443451] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-0
[2013-07-11 16:21:22.729423] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
[2013-07-11 16:21:22.730976] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7f7a69fee13d] (-->/usr/lib64/libpthread.so.0(+0x33c1607c53) [0x7f7a6a684c53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7f7a6b372e35]))) 0-: received signum (15), shutting down
[2013-07-11 16:21:22.731040] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.


Blank space - mount -av below.

[2013-07-11 16:39:36.625696] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0beta3 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=192.168.253.1 /firewall-scripts)
[2013-07-11 16:39:36.640661] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-07-11 16:39:36.640800] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
[2013-07-11 16:39:36.672416] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
[2013-07-11 16:39:36.672539] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
[2013-07-11 16:39:36.674545] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
[2013-07-11 16:39:36.674667] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
[2013-07-11 16:39:36.675015] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting connect on transport
[2013-07-11 16:39:36.686253] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting connect on transport
Given volfile:
+------------------------------------------------------------------------------+
1: volume firewall-scripts-client-0
2: type protocol/client
3: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
4: option username de6eacd1-31bc-4bdb-a049-776cd840059e
5: option transport-type tcp
6: option remote-subvolume /gluster-fw1
7: option remote-host 192.168.253.1
8: end-volume
9:
10: volume firewall-scripts-client-1
11: type protocol/client
12: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
13: option username de6eacd1-31bc-4bdb-a049-776cd840059e
14: option transport-type tcp
15: option remote-subvolume /gluster-fw2
16: option remote-host 192.168.253.2
17: end-volume
18:
19: volume firewall-scripts-replicate-0
20: type cluster/replicate
21: subvolumes firewall-scripts-client-0 firewall-scripts-client-1
22: end-volume
23:
24: volume firewall-scripts-dht
25: type cluster/distribute
26: subvolumes firewall-scripts-replicate-0
27: end-volume
28:
29: volume firewall-scripts-write-behind
30: type performance/write-behind
31: subvolumes firewall-scripts-dht
32: end-volume
33:
34: volume firewall-scripts-read-ahead
35: type performance/read-ahead
36: subvolumes firewall-scripts-write-behind
37: end-volume
38:
39: volume firewall-scripts-io-cache
40: type performance/io-cache
41: subvolumes firewall-scripts-read-ahead
42: end-volume
43:
44: volume firewall-scripts-quick-read
45: type performance/quick-read
46: subvolumes firewall-scripts-io-cache
47: end-volume
48:
49: volume firewall-scripts-open-behind
50: type performance/open-behind
51: subvolumes firewall-scripts-quick-read
52: end-volume
53:
54: volume firewall-scripts-md-cache
55: type performance/md-cache
56: subvolumes firewall-scripts-open-behind
57: end-volume
58:
59: volume firewall-scripts
60: type debug/io-stats
61: option count-fop-hits off
62: option latency-measurement off
63: subvolumes firewall-scripts-md-cache
64: end-volume

+------------------------------------------------------------------------------+
[2013-07-11 16:39:36.698740] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-11 16:39:36.698974] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-11 16:39:36.711537] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-1: changing port to 49152 (from 0)
[2013-07-11 16:39:36.711717] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
[2013-07-11 16:39:36.723116] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:36.723521] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:36.723913] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-11 16:39:36.723995] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:36.724390] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-11 16:39:36.724601] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
[2013-07-11 16:39:36.724730] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-11 16:39:36.724788] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:36.737359] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 16:39:36.739297] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 16:39:36.739486] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 16:39:36.740672] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 16:39:36.741820] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-0

And from fw2:

[***@chicago-fw2 ~]# tail /var/log/glusterfs/firewall-scripts.log -f
[2013-07-11 15:51:45.499012] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 15:51:45.512667] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 15:51:45.513211] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
[2013-07-11 15:51:45.513416] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 15:51:45.513538] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 15:51:45.515208] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 15:51:45.516512] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-1
[2013-07-11 16:21:28.150710] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
[2013-07-11 16:21:28.154455] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7fa599ad613d] (-->/usr/lib64/libpthread.so.0(+0x3c1b407c53) [0x7fa59a16cc53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fa59ae5ae35]))) 0-: received signum (15), shutting down
[2013-07-11 16:21:28.154503] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.


Blank space - this is where I did mount -av

[2013-07-11 16:39:35.100584] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0beta3 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=192.168.253.2 /firewall-scripts)
[2013-07-11 16:39:35.113481] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-07-11 16:39:35.113614] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
[2013-07-11 16:39:35.147118] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
[2013-07-11 16:39:35.147313] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
[2013-07-11 16:39:35.149112] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
[2013-07-11 16:39:35.149268] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
[2013-07-11 16:39:35.149390] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting connect on transport
[2013-07-11 16:39:35.160491] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting connect on transport
Given volfile:
+------------------------------------------------------------------------------+
1: volume firewall-scripts-client-0
2: type protocol/client
3: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
4: option username de6eacd1-31bc-4bdb-a049-776cd840059e
5: option transport-type tcp
6: option remote-subvolume /gluster-fw1
7: option remote-host 192.168.253.1
8: end-volume
9:
10: volume firewall-scripts-client-1
11: type protocol/client
12: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
13: option username de6eacd1-31bc-4bdb-a049-776cd840059e
14: option transport-type tcp
15: option remote-subvolume /gluster-fw2
16: option remote-host 192.168.253.2
17: end-volume
18:
19: volume firewall-scripts-replicate-0
20: type cluster/replicate
21: subvolumes firewall-scripts-client-0 firewall-scripts-client-1
22: end-volume
23:
24: volume firewall-scripts-dht
25: type cluster/distribute
26: subvolumes firewall-scripts-replicate-0
27: end-volume
28:
29: volume firewall-scripts-write-behind
30: type performance/write-behind
31: subvolumes firewall-scripts-dht
32: end-volume
33:
34: volume firewall-scripts-read-ahead
35: type performance/read-ahead
36: subvolumes firewall-scripts-write-behind
37: end-volume
38:
39: volume firewall-scripts-io-cache
40: type performance/io-cache
41: subvolumes firewall-scripts-read-ahead
42: end-volume
43:
44: volume firewall-scripts-quick-read
45: type performance/quick-read
46: subvolumes firewall-scripts-io-cache
47: end-volume
48:
49: volume firewall-scripts-open-behind
50: type performance/open-behind
51: subvolumes firewall-scripts-quick-read
52: end-volume
53:
54: volume firewall-scripts-md-cache
55: type performance/md-cache
56: subvolumes firewall-scripts-open-behind
57: end-volume
58:
59: volume firewall-scripts
60: type debug/io-stats
61: option count-fop-hits off
62: option latency-measurement off
63: subvolumes firewall-scripts-md-cache
64: end-volume

+------------------------------------------------------------------------------+
[2013-07-11 16:39:35.173867] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-11 16:39:35.174065] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-1: changing port to 49152 (from 0)
[2013-07-11 16:39:35.174377] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-11 16:39:35.185807] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
[2013-07-11 16:39:35.197485] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:35.197740] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:35.198257] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-11 16:39:35.198346] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:35.198546] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-11 16:39:35.198759] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-11 16:39:35.198810] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:35.211534] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 16:39:35.211921] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 16:39:35.212098] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
[2013-07-11 16:39:35.212234] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 16:39:35.213421] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 16:39:35.214372] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: sele
Joe Julian
2013-07-11 19:47:08 UTC
Permalink
Ok, now I'm intrigued.

btw... when I read your initial email I was on my phone. I only got as
far as the selinux error before my ADHD got the better of me and I
thought, "well it says what the problem is right there." Sorry, or I
would have answered at that time.

As it turns out, reading further that error you're seeing comes from
glusterfsd.service (not glusterd.service) which shouldn't even be
enabled unless you're trying to use old legacy volfiles from 3.0. The
"parsing the volfile failed" was spurious, as you discovered.

As for your current problem...

Are your two machines perhaps connected via crossover cable?

The question comes down to, when you're on 192.168.253.1 and shut down
192.168.253.2, what prevents .1 from being able to be reached? Is it,
perhaps, because it's gone offline? Check dmesg. See if you can ping the
.1 address (when .2 is down) and see if you can telnet to port 24007 on .1.
Post by Greg Scott
Post by Joe Julian
When you first mount your volume, look in the client log and see if it's connecting to both bricks.
I suspect it's not and that the failure is related to firewall settings.
Logs from both nodes below. For this test, first I did "umount /firewall-scripts" from both nodes. Then I did “mount –av” using the default parameters in my fstab file. I did **not** turn on the backupvolfile-server=<secondary server> for this test. And then in another window, I did "tail tail /var/log/glusterfs/firewall-scripts.log -f" and you can see the spot where I mounted my file system back up again.
Note that everything works as expected when both nodes are online, so this suggests everyone can see everyone else when things are steady-state. Also note that backupvolfile-server=<secondary server> changed the behavior - I documented this in an earlier post.
Post by Joe Julian
...the failure is related to firewall settings.
No way. I’m wide open on the interface I’m using for heartbeat and glusterfs. In my application, I take node fw1 offline by inserting a firewall rule and then getting rid of it a few seconds later. For testing right now, I just insert the rule by hand, look at a bunch of stuff, then get rid of it later. But since you brought it up, I cleaned out all firewall rules before doing and logging the mounts below. Near as I can tell, it looks like everyone can see everyone else. And the logs look the same to my eye as they did before I dropped all (not relevant) firewall rules.
[2013-07-11 15:51:54.423508] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-11 15:51:54.423576] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 15:51:54.440124] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 15:51:54.440660] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 15:51:54.440886] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 15:51:54.442235] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 15:51:54.443451] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-0
[2013-07-11 16:21:22.729423] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
[2013-07-11 16:21:22.730976] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7f7a69fee13d] (-->/usr/lib64/libpthread.so.0(+0x33c1607c53) [0x7f7a6a684c53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7f7a6b372e35]))) 0-: received signum (15), shutting down
[2013-07-11 16:21:22.731040] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.
Blank space - mount -av below.
[2013-07-11 16:39:36.625696] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0beta3 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=192.168.253.1 /firewall-scripts)
[2013-07-11 16:39:36.640661] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-07-11 16:39:36.640800] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
[2013-07-11 16:39:36.672416] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
[2013-07-11 16:39:36.672539] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
[2013-07-11 16:39:36.674545] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
[2013-07-11 16:39:36.674667] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
[2013-07-11 16:39:36.675015] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting connect on transport
[2013-07-11 16:39:36.686253] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting connect on transport
+------------------------------------------------------------------------------+
1: volume firewall-scripts-client-0
2: type protocol/client
3: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
4: option username de6eacd1-31bc-4bdb-a049-776cd840059e
5: option transport-type tcp
6: option remote-subvolume /gluster-fw1
7: option remote-host 192.168.253.1
8: end-volume
10: volume firewall-scripts-client-1
11: type protocol/client
12: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
13: option username de6eacd1-31bc-4bdb-a049-776cd840059e
14: option transport-type tcp
15: option remote-subvolume /gluster-fw2
16: option remote-host 192.168.253.2
17: end-volume
19: volume firewall-scripts-replicate-0
20: type cluster/replicate
21: subvolumes firewall-scripts-client-0 firewall-scripts-client-1
22: end-volume
24: volume firewall-scripts-dht
25: type cluster/distribute
26: subvolumes firewall-scripts-replicate-0
27: end-volume
29: volume firewall-scripts-write-behind
30: type performance/write-behind
31: subvolumes firewall-scripts-dht
32: end-volume
34: volume firewall-scripts-read-ahead
35: type performance/read-ahead
36: subvolumes firewall-scripts-write-behind
37: end-volume
39: volume firewall-scripts-io-cache
40: type performance/io-cache
41: subvolumes firewall-scripts-read-ahead
42: end-volume
44: volume firewall-scripts-quick-read
45: type performance/quick-read
46: subvolumes firewall-scripts-io-cache
47: end-volume
49: volume firewall-scripts-open-behind
50: type performance/open-behind
51: subvolumes firewall-scripts-quick-read
52: end-volume
54: volume firewall-scripts-md-cache
55: type performance/md-cache
56: subvolumes firewall-scripts-open-behind
57: end-volume
59: volume firewall-scripts
60: type debug/io-stats
61: option count-fop-hits off
62: option latency-measurement off
63: subvolumes firewall-scripts-md-cache
64: end-volume
+------------------------------------------------------------------------------+
[2013-07-11 16:39:36.698740] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-11 16:39:36.698974] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-11 16:39:36.711537] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-1: changing port to 49152 (from 0)
[2013-07-11 16:39:36.711717] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
[2013-07-11 16:39:36.723116] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:36.723521] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:36.723913] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-11 16:39:36.723995] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:36.724390] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-11 16:39:36.724601] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
[2013-07-11 16:39:36.724730] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-11 16:39:36.724788] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:36.737359] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 16:39:36.739297] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 16:39:36.739486] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 16:39:36.740672] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 16:39:36.741820] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-0
[2013-07-11 15:51:45.499012] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 15:51:45.512667] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 15:51:45.513211] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
[2013-07-11 15:51:45.513416] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 15:51:45.513538] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 15:51:45.515208] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 15:51:45.516512] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-1
[2013-07-11 16:21:28.150710] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
[2013-07-11 16:21:28.154455] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7fa599ad613d] (-->/usr/lib64/libpthread.so.0(+0x3c1b407c53) [0x7fa59a16cc53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fa59ae5ae35]))) 0-: received signum (15), shutting down
[2013-07-11 16:21:28.154503] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.
Blank space - this is where I did mount -av
[2013-07-11 16:39:35.100584] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0beta3 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=192.168.253.2 /firewall-scripts)
[2013-07-11 16:39:35.113481] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-07-11 16:39:35.113614] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
[2013-07-11 16:39:35.147118] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
[2013-07-11 16:39:35.147313] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
[2013-07-11 16:39:35.149112] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
[2013-07-11 16:39:35.149268] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
[2013-07-11 16:39:35.149390] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting connect on transport
[2013-07-11 16:39:35.160491] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting connect on transport
+------------------------------------------------------------------------------+
1: volume firewall-scripts-client-0
2: type protocol/client
3: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
4: option username de6eacd1-31bc-4bdb-a049-776cd840059e
5: option transport-type tcp
6: option remote-subvolume /gluster-fw1
7: option remote-host 192.168.253.1
8: end-volume
10: volume firewall-scripts-client-1
11: type protocol/client
12: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
13: option username de6eacd1-31bc-4bdb-a049-776cd840059e
14: option transport-type tcp
15: option remote-subvolume /gluster-fw2
16: option remote-host 192.168.253.2
17: end-volume
19: volume firewall-scripts-replicate-0
20: type cluster/replicate
21: subvolumes firewall-scripts-client-0 firewall-scripts-client-1
22: end-volume
24: volume firewall-scripts-dht
25: type cluster/distribute
26: subvolumes firewall-scripts-replicate-0
27: end-volume
29: volume firewall-scripts-write-behind
30: type performance/write-behind
31: subvolumes firewall-scripts-dht
32: end-volume
34: volume firewall-scripts-read-ahead
35: type performance/read-ahead
36: subvolumes firewall-scripts-write-behind
37: end-volume
39: volume firewall-scripts-io-cache
40: type performance/io-cache
41: subvolumes firewall-scripts-read-ahead
42: end-volume
44: volume firewall-scripts-quick-read
45: type performance/quick-read
46: subvolumes firewall-scripts-io-cache
47: end-volume
49: volume firewall-scripts-open-behind
50: type performance/open-behind
51: subvolumes firewall-scripts-quick-read
52: end-volume
54: volume firewall-scripts-md-cache
55: type performance/md-cache
56: subvolumes firewall-scripts-open-behind
57: end-volume
59: volume firewall-scripts
60: type debug/io-stats
61: option count-fop-hits off
62: option latency-measurement off
63: subvolumes firewall-scripts-md-cache
64: end-volume
+------------------------------------------------------------------------------+
[2013-07-11 16:39:35.173867] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-11 16:39:35.174065] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-firewall-scripts-client-1: changing port to 49152 (from 0)
[2013-07-11 16:39:35.174377] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-11 16:39:35.185807] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
[2013-07-11 16:39:35.197485] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:35.197740] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-11 16:39:35.198257] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-11 16:39:35.198346] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:35.198546] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back up; going online.
[2013-07-11 16:39:35.198759] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-11 16:39:35.198810] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2013-07-11 16:39:35.211534] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-11 16:39:35.211921] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version = 1
[2013-07-11 16:39:35.212098] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version = 1
[2013-07-11 16:39:35.212234] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.21
[2013-07-11 16:39:35.213421] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root inode
[2013-07-11 16:39:35.214372] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child firewall-scripts-client-1
Greg Scott
2013-07-11 23:09:04 UTC
Permalink
Post by Joe Julian
Are your two machines perhaps connected via crossover cable?
Well, it's a little more complicated but essentially true. What follows is a bunch of probably not relevant detail but I owe it to you if I want my problem worked so here goes.

The NICs are those new smart ones that "know" what kind of cable they see and can reverse the send and receive signals when they need to. So to be completely accurate, I'm using straight-thru cables to connect everything, not crossovers. The actual hardware is a Jetway motherboard with 2 onboard NIC slots, with the daughtercard that has 3 NIC slots. I have the actual model numbers on some paperwork around here someplace. These are all inside a nifty little Mini-ITX case, so I have a box around 6 inches square and maybe 1 1/2 inches thick with 5 NIC slots. I am in love with this hardware, at least for firewalls.

F19 has a new-new way to name Ethernet interface names, so these names will look a little strange. But here are the details of which interfaces go where - they're the same in both systems.

Interface enp2s0 goes to my simulated Internet, for now, just an older Linux system with the default Gateway address. The actual physical path is through an old broken-down Ethernet switch and into my simulated default gateway.
Interface enp3s0 is the LAN side. Right now in my testbed, these are empty on both nodes.
Interface enp5s4 does heartbeat and Gluster. Point to point fw1 <--> fw2. The IP Address on fw1 is 192.168.253.1 and on fw2, ...253.2.
Interface enp5s6 is empty and will be unused in this application.
Interface enp5s7 is for a future DMZ, but really for development. It connects to my Ethernet switch right now and is set up to live in my LAN, but will be empty in production. This comes in handy because I can get to both systems in different windows on my workstation here. I do all my firewalls this way, with an exta NIC for "kind of out of band" debugging.

The hardware and cabling all work just fine. No issues. I **purposely** isolate fw1 from fw2 on interface enp5s4 to reproduce the problem.

I first discovered the problem when booting each node. I have some logic in my bootup that figures out a least assertive and most assertive partner. The least assertive partner takes its heartbeat/gluster interface offline for a few seconds, so the most assertive partner will miss a couple of pings on the heartbeat interface and take control. This worked well for several years when both systems were completely separate and I manually kept up my config files on each node. It also worked well with older versions of Gluster a couple years ago. But now, trying to use the latest and greatest Gluster, my most assertive partner would never take control. Digging into it, I found it could not find its rc.firewall script. Of course, by the time I was done digging through my own application logs, the error condition that setup the problem was long gone and everyone could see everyone again. So all I had was my failover.log with a message about can't find rc.firewall.

I've had startup issues before, but this felt different. So I came up with an experiment. Node fw1 will always be the least assertive firewall partner, and node fw1 is where I did all the initial Gluster setup. I think this combination turns out to be relevant in a few sentences.

So the experiment - on node fw1, put in this firewall rule to reject everything from its partner as the very first rule. This isolates fw1 and fw2 from each other, but I can still see both of them from my workstation.

iptables -I INPUT 1 -i enp5s4 -s 192.168.253.2 -j REJECT

And then on node fw2, try to do

ls /firewall-scripts

Sure enough, that failed on fw2. Node fw2 was unable to access /firewall-scripts.

After reproducing the problem, run this firewall rule on fw1 to get rid of that reject rule so fw1 and fw2 can find each other again:

iptables -D INPUT -i enp5s4 -s 192.168.253.2 -j REJECT

And within a second or so, or as soon as I could whip up an "ls /firewall-scripts" command on fw2, it could now see that directory again. I posted the logs from all that earlier. But all the logs really tell us is, fw1 and fw2 are isolated from each other. Well, duh! They're isolated because I isolated them!

I've also noticed the behavior seems slightly different when I take fw2 offline and try my ls command from fw1. Reading through what I can get my hands on, it seems the first brick is apparently kind of a "master", and fw1 is my first brick. So fw1 is the "important" Gluster partner, but in my application, the least assertive partner. When both nodes boot at the same time, fw1 will always isolate itself from fw2 for a few seconds, which will mess up fw2, and I'll end up with a firewall system in which nobody asserts itself.

So the Gluster behavior broke my startup, although I have some ideas to work around that. More important, this system will be 400 miles from me and it has to be reliable. What happens when one node goes offline in production, and the other node cannot take control because it can't find the directory with all its scripts? Right when I need my carefully scripted automated failover the most, it may break because it can't find the scripts it needs. That kind of stuff is bad for business.

Anyway, now, after several mounts and umounts and different combinations of mount options, I should probably comment out all my startup stuff again, reboot both boxes, and try some even more structured and methodical tests.

Or hope for a shortcut to all that testing if anyone has seen this behavior before and has a way around it.

OK, what behavior would I like to see? Both nodes should try to satisfy reads locally. Why reach across the network when there's a copy right here? If the nodes become isolated from each other, they should still satisfy reads locally, so daemons and other apps running on those nodes can continue to run. Writes? Well - satisfy the write locally for now and keep track of what needs to copy over the network. When the far-end node comes back online, send the changes. And maybe provide an option for how to handle conflicts when everyone updates the same file.

I know in my use case, I need that /firewall-scripts Gluster directory to stay available on the surviving node when one node goes offline. I can't have failover scripts not run because they can't find themse
Greg Scott
2013-07-11 23:29:42 UTC
Permalink
Oh yes - This question will come up about how I do failover in my application and somebody is bound to remind me that initiating a failover just because heartbeat goes away isn't good enough. All I can say is, yes, I know. I fail over if and only if the active partner does not answer on its heartbeat interface, and does not answer on all Internet, all LAN, and all optional DMZ interfaces, and if the default gateway does answer. I do a failover if and only if all those those conditions are true.

But none of this is relevant to my Gluster issue and that's where I need some help.

Thanks

- Greg
Greg Scott
2013-07-11 23:43:53 UTC
Permalink
So back to the problem at hand - I think what's going on is, both nodes fw1 and fw2 try to satisfy reads from fw1 first. That's why fw2 can't find the /firewall-scripts file system when it becomes isolated from fw1, and why fw1 always seems to be able to find it. What makes fw1 so important? Near as I can tell, because fw1 is the first in the list and I used node fw1 to set up my Gluster volume.

So after putting up with me for page after page of text digging into the problem details, is there anything we can do to tell Gluster to satisfy reads locally, especially when the other brick is offline?

Thanks

- Greg
Greg Scott
2013-07-13 15:19:18 UTC
Permalink
I was out all day yesterday - is there anything I can do to fix this problem or is this just pretty much how Gluster works?

- Greg


-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Thursday, July 11, 2013 6:44 PM
To: 'Joe Julian'
Cc: 'gluster-***@gluster.org'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

So back to the problem at hand - I think what's going on is, both nodes fw1 and fw2 try to satisfy reads from fw1 first. That's why fw2 can't find the /firewall-scripts file system when it becomes isolated from fw1, and why fw1 always seems to be able to find it. What makes fw1 so important? Near as I can tell, because fw1 is the first in the list and I used node fw1 to set up my Gluster volume.

So after putting up with me for page after page of text digging into the problem details, is there anything we can do to tell Gluster to satisfy reads locally, especially when the other brick is offline?

Thanks

- Greg
Joe Julian
2013-07-13 16:22:32 UTC
Permalink
No, they're equal peers. Each client connects to both servers after retrieving the configuration from the server specified in the mount command.

When a server shuts down, the TCP connection is properly closed and the clients continue to operate with the remaining servers. In a replicated volume that means without any missing data.

When the TCP connection is not closed, the client will attempt to reach the missing server for 42 (network.ping-timeout) seconds. The filesystem appears frozen during that timeout. Once timed out, the client should continue as above.

Your logs, however, say that the client has lost connection with ALL the servers. What I've seen in your logs so far, however, don't show both disconnects. I've only seen the last. If you'll follow my instructions, I can get a clearer picture of what's going wrong.

This is one of the reasons I hate mailing lists and do most of my support via IRC. On IRC there's not these hours or days long delays between. We're generally able to solve the worst problems in at few hours so I feel I am making a difference.

Anyway, follow my complete instructions and I'll help you further. I'm sure we can figure this out.
Post by Greg Scott
I was out all day yesterday - is there anything I can do to fix this
problem or is this just pretty much how Gluster works?
- Greg
-----Original Message-----
Sent: Thursday, July 11, 2013 6:44 PM
To: 'Joe Julian'
Subject: Re: [Gluster-users] One node goes offline, the other node
can't see the replicated volume anymore
So back to the problem at hand - I think what's going on is, both nodes
fw1 and fw2 try to satisfy reads from fw1 first. That's why fw2 can't
find the /firewall-scripts file system when it becomes isolated from
fw1, and why fw1 always seems to be able to find it. What makes fw1 so
important? Near as I can tell, because fw1 is the first in the list
and I used node fw1 to set up my Gluster volume.
So after putting up with me for page after page of text digging into
the problem details, is there anything we can do to tell Gluster to
satisfy reads locally, especially when the other brick is offline?
Thanks
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Greg Scott
2013-07-13 21:03:26 UTC
Permalink
➢ Anyway, follow my complete instructions and I'll help you further. I'm sure we can figure this out.

OK, I’m an official dork. What instructions? And now I’m intrigued. How can a node lose connection with itself?
Joe Julian
2013-07-13 21:27:37 UTC
Permalink
Huh.. this was in my sent folder... let's try again.

There's something missing from this picture. The logs show that the
client is connecting to both servers, but it only shows the
disconnection from one and claims that it's not connected to any bricks
after that.

Here's the data I'd like to have you generate:

unmount the clients
gluster volume set firewall-scripts diagnostics.client-log-level DEBUG
gluster volume set firewall-scripts diagnostics.brick-log-level DEBUG
systemctl stop glusterd.service
truncate the client, glusterd, and server logs
systemctl start glusterd
mount /firewall-scripts
Do your iptables disconnect
telnet $this_host_ip 24007 # report whether or not it establishes a
connection
ls /firewall-scripts
wait 42 seconds
ls /firewall-scripts
Remove the iptables rule
ls /firewall-scripts
tar up the logs and email them to me.

You can reset the log-level:

gluster volume reset firewall-scripts diagnostics.client-log-level
gluster volume reset firewall-scripts diagnostics.brick-log-level

lastly, do you have a loopback interface (lo) on 127.0.0.1 and is
localhost defined in /etc/hosts?
Greg Scott
2013-07-13 23:32:45 UTC
Permalink
Ok – starting on it now. On this question:

➢ lastly, do you have a loopback interface (lo) on 127.0.0.1 and is localhost defined in /etc/hosts?

Yes.

[***@chicago-fw1 ~]# ip addr show dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
[***@chicago-fw1 ~]#
[***@chicago-fw1 ~]# more /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
[***@chicago-fw1 ~]#

And

[***@chicago-fw2 ~]# ip addr show dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
[***@chicago-fw2 ~]#
[***@chicago-fw2 ~]# more /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
[***@chicago-fw2 ~]#

- Greg

From: Joe Julian [mailto:***@julianfamily.org]
Sent: Saturday, July 13, 2013 4:28 PM
To: Greg Scott; 'gluster-***@gluster.org'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Huh.. this was in my sent folder... let's try again.

There's something missing from this picture. The logs show that the client is connecting to both servers, but it only shows the disconnection from one and claims that it's not connected to any bricks after that.

Here's the data I'd like to have you generate:

unmount the clients
gluster volume set firewall-scripts diagnostics.client-log-level DEBUG
gluster volume set firewall-scripts diagnostics.brick-log-level DEBUG
systemctl stop glusterd.service
truncate the client, glusterd, and server logs
systemctl start glusterd
mount /firewall-scripts
Do your iptables disconnect
telnet $this_host_ip 24007 # report whether or not it establishes a connection
ls /firewall-scripts
wait 42 seconds
ls /firewall-scripts
Remove the iptables rule
ls /firewall-scripts
tar up the logs and email them to me.

You can reset the log-level:

gluster volume reset firewall-scripts diagnostics.client-log-level
gluster volume reset firewall-scripts diagnostics.brick-log-level

lastly, do you have a loopback interface (lo) on 12
Greg Scott
2013-07-13 23:58:15 UTC
Permalink
Log files sent privately to Joe. If others from the community want to look at them, I’m OK with posting them here. I don’t think they have anything confidential. Now that I know about that 42 second timeout, the behavior makes more sense. Why 42? What’s special about 42? Is there a way I adjust that down for my application to, say, 1 or 2 seconds?


- Greg

From: Joe Julian [mailto:***@julianfamily.org]
Sent: Saturday, July 13, 2013 4:28 PM
To: Greg Scott; 'gluster-***@gluster.org'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Huh.. this was in my sent folder... let's try again.

There's something missing from this picture. The logs show that the client is connecting to both servers, but it only shows the disconnection from one and claims that it's not connected to any bricks after that.

Here's the data I'd like to have you generate:

unmount the clients
gluster volume set firewall-scripts diagnostics.client-log-level DEBUG
gluster volume set firewall-scripts diagnostics.brick-log-level DEBUG
systemctl stop glusterd.service
truncate the client, glusterd, and server logs
systemctl start glusterd
mount /firewall-scripts
Do your iptables disconnect
telnet $this_host_ip 24007 # report whether or not it establishes a connection
ls /firewall-scripts
wait 42 seconds
ls /firewall-scripts
Remove the iptables rule
ls /firewall-scripts
tar up the logs and email them to me.

You can reset the log-level:

gluster volume reset firewall-scripts diagnostics.client-log-level
gluster volume reset firewall-scripts diagnostics.brick-log-level

lastly, do you have a loopback interface (lo) on 127.0.0.1 and is localhost defined in /etc/hosts?
Greg Scott
2013-07-14 00:00:54 UTC
Permalink
Oh yes – telnet localhost 24007 was successful on both nodes even after isolating fw1.


- Greg

From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Saturday, July 13, 2013 6:58 PM
To: 'Joe Julian'; 'gluster-***@gluster.org'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Log files sent privately to Joe. If others from the community want to look at them, I’m OK with posting them here. I don’t think they have anything confidential. Now that I know about that 42 second timeout, the behavior makes more sense. Why 42? What’s special about 42? Is there a way I adjust that down for my application to, say, 1 or 2 seconds?


- Greg
Joe Julian
2013-07-14 00:37:36 UTC
Permalink
These logs show different results. The results you reported and pasted
earlier included, "[2013-07-09 00:59:04.706390] I
[afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no
subvolumes up", which would produce the "Transport endpoint not
connected" error you reported at first. These results look normal and
should have produced the behavior I described.

42 is The Answer to Life, The Universe, and Everything.

Re-establishing FDs and locks is an expensive operation. The
ping-timeout is long because it should not happen, but if there is
temporary network congestion you'd (normally) rather have your volume
remain up and pause than have to re-establish everything. Typically,
unless you expect your servers to crash often, leaving ping-timeout at
the default is best. YMMV and it's configurable in case you know what
you're doing and why.
Post by Greg Scott
Log files sent privately to Joe. If others from the community want to
look at them, I’m OK with posting them here. I don’t think they have
anything confidential. Now that I know about that 42 second timeout,
the behavior makes more sense. Why 42? What’s special about 42?
Is there a way I adjust that down for my application to, say, 1 or 2
seconds?
-Greg
*Sent:* Saturday, July 13, 2013 4:28 PM
*Subject:* Re: [Gluster-users] One node goes offline, the other node
can't see the replicated volume anymore
Huh.. this was in my sent folder... let's try again.
There's something missing from this picture. The logs show that the
client is connecting to both servers, but it only shows the
disconnection from one and claims that it's not connected to any
bricks after that.
unmount the clients
gluster volume set firewall-scripts diagnostics.client-log-level DEBUG
gluster volume set firewall-scripts diagnostics.brick-log-level DEBUG
systemctl stop glusterd.service
truncate the client, glusterd, and server logs
systemctl start glusterd
mount /firewall-scripts
Do your iptables disconnect
telnet $this_host_ip 24007 # report whether or not it establishes a connection
ls /firewall-scripts
wait 42 seconds
ls /firewall-scripts
Remove the iptables rule
ls /firewall-scripts
tar up the logs and email them to me.
gluster volume reset firewall-scripts diagnostics.client-log-level
gluster volume reset firewall-scripts diagnostics.brick-log-level
lastly, do you have a loopback interface (lo) on 127.0.0.1 and is
localhost defined in /etc/hosts?
Greg Scott
2013-07-14 02:13:23 UTC
Permalink
Hmmm – I wonder what’s different now when it behaves as expected versus before when it behaved badly?

Well – by now both systems have been up and running in my testbed for several days. I’ve umounted and mounted the volumes a bunch of times. But thinking back – the behavior changed when I mounted the volume on each node with the other node as the backupvolfile.

On fw1:
mount -t glusterfs -o backupvolfile-server=192.168.253.2 192.168.253.1:/firewall-scripts /firewall-scripts

And on fw2:
mount -t glusterfs -o backupvolfile-server=192.168.253.1 192.168.253.2:/firewall-scripts /firewall-scripts

Since then, I’ve stopped and restarted glusterd and umounted and mounted the volumes again as set up in fstab without the backupvolfile. But maybe that backupvolfile switch set some parameter permanently.

Here is the rc.local I set up in each node. I wonder if some kind of timing thing is going on? Or if -o backupvolfile-server=(the other node) permanently cleared a glitch from the initial setup? I guess I could try some reboots and see what happens.

#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
#
# Note removed by default starting in Fedora 16.

touch /var/lock/subsys/local

#***********************************
# Local stuff below

echo "Making sure the Gluster stuff is mounted"
mount -av
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't seem
# to work right.

echo "Starting up firewall common items"
/firewall-scripts/etc/rc.d/common-rc.local
[***@chicago-fw1 log]#

Here is what fstab looks like on each node.

From fw1:

[***@chicago-fw1 log]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 04:26:01 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=818c4142-e389-4f28-a28e-6e26df3caa32 /boot ext4 defaults 1 2
UUID=C57B-BCF9 /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw1 /gluster-fw1 xfs defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw1 log]#

And fw2:

[***@chicago-fw2 log]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 05:08:55 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=f0cceb6a-61c4-409b-b882-5d6779a52505 /boot ext4 defaults 1 2
UUID=665D-DF0B /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw2 /gluster-fw2 ext4 defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw2 log]#

- Greg

From: Joe Julian [mailto:***@julianfamily.org]
Sent: Saturday, July 13, 2013 7:38 PM
To: Greg Scott
Cc: 'gluster-***@gluster.org'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

These logs show different results. The results you reported and pasted earlier included, "[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no subvolumes up", which would produce the "Transport endpoint not connected" error you reported at first. These results look normal and should have produced the behavior I described.

42 is The Answer to Life, The Universe, and Everything.

Re-establishing FDs and locks is an expensive operation. The ping-timeout is long because it should not happen, but if there is temporary network congestion you'd (normally) rather have your volume remain up and pause than have to re-establish everything. Typically, unless you expect your servers to crash often, leaving ping-timeout at the default is best. YMMV and it's configurable in c
Greg Scott
2013-07-14 02:23:35 UTC
Permalink
I have a thought brewing in my head - how does Gluster "know" the other node is down? Is it really ICMP pings? Or is there some kind of heartbeat dialog on TCP port 24007? Here is where I'm going with this. My application uses old-fashioned ICMP pings. When I purposely isolate fw1 and fw2, I used to just do ifdown $HBEAT_IFACE on my least assertive partner for a few seconds at startup time. I modified it to use the iptables rule I documented before because of the Gluster troubles and I figured downing the whole interface may have been a bit radical since Gluster depends on it. So I got a little finer grained and put in that iptables rule instead. But I could get even finer grained and just as easily only block ICMP - or even finer, just block ICMP echo request - and that should satisfy my application and leave Gluster alone.

Then the testing would switch to test when the other node down really is an exception condition. Does this make sense?
Greg Scott
2013-07-15 19:22:46 UTC
Permalink
Well, none of my ideas worked. I see that Gluster is up to the real 3.4.0 now. No more beta. So after a yum update and reboot of both fw1 and fw2, I decided to focus only on mounting my /firewall-scripts volume at startup time. Forget about my application and taking a node offline and testing, let's just get the volume mounted properly first at startup time. Cover the basics first.

I have an rc.local that mounts my filesystem and then runs a common script that lives inside that file system. That line is commented out but systemd is apparently trying to execute it anyway. Here is what /etc/rc.d/rc.local currently looks like, followed by an extract from /var/log/messages showing what actually happens. Warning, it's ugly. Viewer discretion is advised.

#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
#
# Note removed by default starting in Fedora 16.

touch /var/lock/subsys/local

#***********************************
# Local stuff below

echo "Making sure the Gluster stuff is mounted"
echo "Mounted before mount -av"
df -h
mount -av
echo "Mounted after mount -av"
df -h
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't seem
# to work right.

echo "Starting up firewall common items"
##/firewall-scripts/etc/rc.d/common-rc.local

[***@chicago-fw2 rc.d]#

And here is the extract from /var/log/messages on fw1 showing what actually happens. The log on fw2 is similar.

Jul 15 13:49:59 chicago-fw1 audispd: queue is full - dropping event
Jul 15 13:49:59 chicago-fw1 audispd: queue is full - dropping event
Jul 15 13:49:59 chicago-fw1 audispd: queue is full - dropping event
Jul 15 13:50:00 chicago-fw1 setroubleshoot: SELinux is preventing /usr/sbin/glusterfsd from 'read, write' accesses on the chr_file fuse. For complete SELinux messages. run sealert -l ff532d9a-f5$
Jul 15 13:50:01 chicago-fw1 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 15 13:50:01 chicago-fw1 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 13:50:01 chicago-fw1 glusterfsd[1255]: [2013-07-15 18:50:01.409064] C [glusterfsd.c:1374:parse_cmdline] 0-glusterfs: ERROR: parsing the volfile failed (No such file or directory)
Jul 15 13:50:01 chicago-fw1 glusterfsd[1255]: USAGE: /usr/sbin/glusterfsd [options] [mountpoint]
Jul 15 13:50:01 chicago-fw1 GlusterFS[1255]: [2013-07-15 18:50:01.409064] C [glusterfsd.c:1374:parse_cmdline] 0-glusterfs: ERROR: parsing the volfile failed (No such file or directory)
Jul 15 13:50:01 chicago-fw1 systemd[1]: glusterfsd.service: control process exited, code=exited status=255
Jul 15 13:50:01 chicago-fw1 systemd[1]: Failed to start GlusterFS an clustered file-system server.
Jul 15 13:50:01 chicago-fw1 systemd[1]: Unit glusterfsd.service entered failed state.
Jul 15 13:50:04 chicago-fw1 mount[1002]: Mount failed. Please check the log file for more details.
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: Mount failed. Please check the log file for more details.
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: / : ignored
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /boot : already mounted
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /boot/efi : already mounted
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /gluster-fw1 : already mounted
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: swap : ignored
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /firewall-scripts : successfully mounted
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: Mounted after mount -av
Jul 15 13:50:04 chicago-fw1 systemd[1]: firewall\x2dscripts.mount mount process exited, code=exited status=1
Jul 15 13:50:04 chicago-fw1 systemd[1]: Unit firewall\x2dscripts.mount entered failed state.
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: Filesystem Size Used Avail Use% Mounted on
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-root 14G 3.8G 8.7G 31% /
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: devtmpfs 990M 0 990M 0% /dev
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: tmpfs 996M 872K 996M 1% /run
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /tmp
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /dev/sda2 477M 87M 365M 20% /boot
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 15 13:50:04 chicago-fw1 rc.local[1006]: /etc/rc.d/rc.local: line 26: /firewall-scripts/etc/rc.d/common-rc.local: No such file or directory
Jul 15 13:50:04 chicago-fw1 systemd[1]: rc-local.service: control process exited, code=exited status=127
Jul 15 13:50:04 chicago-fw1 systemd[1]: Failed to start /etc/rc.d/rc.local Compatibility.
Jul 15 13:50:04 chicago-fw1 systemd[1]: Unit rc-local.service entered failed state.
Jul 15 13:50:04 chicago-fw1 systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 15 13:50:04 chicago-fw1 systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Greg Scott
2013-07-15 19:28:55 UTC
Permalink
Maybe I am dealing with a systemd timing glitch because I can do my mount by hand on both nodes.

I do

ls /firewall-scripts, confirm it's empty, then

mount -av, and then another

ls /firewall-scripts and now my files show up. Both nodes behave identically.

[***@chicago-fw2 rc.d]# nano /var/log/messages
[***@chicago-fw2 rc.d]# ls /firewall-scripts
[***@chicago-fw2 rc.d]# mount -av
/ : ignored
/boot : already mounted
/boot/efi : already mounted
/gluster-fw2 : already mounted
swap : ignored
extra arguments at end (ignored)
/firewall-scripts : successfully mounted
[***@chicago-fw2 rc.d]# ls /firewall-scripts
allow-all failover-monitor.sh lost+found route-monitor.sh
allow-all-with-nat fwdate.txt rc.firewall start-failover-monitor.sh
etc initial_rc.firewall rcfirewall.conf var
[***@chicago-fw2 rc.d]#

- Greg
Greg Scott
2013-07-15 20:19:10 UTC
Permalink
Woops, didn't copy the list on this one.
*****

I have SElinux set to permissive mode so those SELinux warnings should not be important. If they were real, I would also have trouble mounting by hand, right?

- Greg

-----Original Message-----
From: Joe Julian [mailto:***@julianfamily.org]
Sent: Monday, July 15, 2013 2:37 PM
To: Greg Scott
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

It's a known selinux bug: https://bugzilla.redhat.com/show_bug.cgi?id=984465

Either add your own via audit2allow or wait for a fix. (I'd do the former).
Post by Greg Scott
Maybe I am dealing with a systemd timing glitch because I can do my mount by hand on both nodes.
I do
ls /firewall-scripts, confirm it's empty, then
mount -av, and then another
ls /firewall-scripts and now my files show up. Both nodes behave identically.
/ : ignored
/boot : already mounted
/boot/efi : already mounted
/gluster-fw2 : already mounted
swap : ignored
extra arguments at end (ignored)
/firewall-scripts : successfully mounted
allow-all failover-monitor.sh lost+found route-monitor.sh
allow-all-with-nat fwdate.txt rc.firewall start-failover-monitor.sh
etc initial_rc.firewall rcfirewall.conf var
- Greg
Greg Scott
2013-07-15 20:29:22 UTC
Permalink
Re: Joe
systemctl disable glusterfsd.service
systemctl enable glusterd.service
Tried this on both nodes and rebooted. Life in the Twilight Zone. First fw1 immediately after logging back in:

[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.8G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 892K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
192.168.253.1:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
[***@chicago-fw1 ~]#
[***@chicago-fw1 ~]# ls /firewall-scripts
allow-all failover-monitor.sh lost+found route-monitor.sh
allow-all-with-nat fwdate.txt rc.firewall start-failover-monitor.sh
etc initial_rc.firewall rcfirewall.conf var
[***@chicago-fw1 ~]#

But it's not mounted on fw2.

[***@chicago-fw2 rc.d]# reboot
login as: root
***@10.10.10.72's password:
Last login: Mon Jul 15 13:53:40 2013 from tinahp100b.infrasupport.local
[***@chicago-fw2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 4.1G 8.4G 33% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 892K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 90M 362M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw2 7.6G 19M 7.2G 1% /gluster-fw2
[***@chicago-fw2 ~]#

Here is an extract from /var/log/messages on fw2.

.
.
.
Jul 15 15:18:26 chicago-fw2 audispd: queue is full - dropping event
Jul 15 15:18:26 chicago-fw2 audispd: queue is full - dropping event
Jul 15 15:18:28 chicago-fw2 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 15:18:28 chicago-fw2 glusterfsd[1220]: [2013-07-15 20:18:28.304028] C [glusterfsd.c:1374:parse_cmdline] 0-glu
sterfs: ERROR: parsing the volfile failed (No such file or directory)
Jul 15 15:18:28 chicago-fw2 glusterfsd[1220]: USAGE: /usr/sbin/glusterfsd [options] [mountpoint]
Jul 15 15:18:28 chicago-fw2 GlusterFS[1220]: [2013-07-15 20:18:28.304028] C [glusterfsd.c:1374:parse_cmdline] 0-glus
terfs: ERROR: parsing the volfile failed (No such file or directory)
Jul 15 15:18:28 chicago-fw2 systemd[1]: glusterfsd.service: control process exited, code=exited status=255
Jul 15 15:18:28 chicago-fw2 systemd[1]: Failed to start GlusterFS an clustered file-system server.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Unit glusterfsd.service entered failed state.
Jul 15 15:18:28 chicago-fw2 mount[997]: Mount failed. Please check the log file for more details.
Jul 15 15:18:28 chicago-fw2 rpc.statd[1258]: Version 1.2.7 starting
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Mount failed. Please check the log file for more details.
Jul 15 15:18:28 chicago-fw2 systemd[1]: firewall\x2dscripts.mount mount process exited, code=exited status=1
Jul 15 15:18:28 chicago-fw2 systemd[1]: Unit firewall\x2dscripts.mount entered failed state.
Jul 15 15:18:28 chicago-fw2 sm-notify[1259]: Version 1.2.7 starting
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: / : ignored
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /boot : already mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /boot/efi : already mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /gluster-fw2 : already mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: swap : ignored
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /firewall-scripts : successfully mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Mounted after mount -av
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Filesystem Size Used Avail Use% Mounted on
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/mapper/fedora-root 14G 4.1G 8.4G 33% /
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: devtmpfs 990M 0 990M 0% /dev
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 880K 996M 1% /run
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 4.0K 996M 1% /tmp
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/sda2 477M 90M 362M 20% /boot
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/mapper/fedora-gluster--fw2 7.6G 19M 7.2G 1% /gluster-fw2
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Starting up firewall common items
Jul 15 15:18:28 chicago-fw2 systemd[1]: Started /etc/rc.d/rc.local Compatibility.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 15 15:18:28 chicago-fw2 systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 15 15:18:28 chicago-fw2 systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Started Wait for Plymouth Boot Screen to Quit.
.
.
.

And the extract from /var/log/messages from fw1

.
.
.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting OpenSSH server daemon...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting /etc/rc.d/rc.local Compatibility...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started Vsftpd ftp daemon.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started RPC bind service.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Making sure the Gluster stuff is mounted
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Mounted before mount -av
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started OpenSSH server daemon.
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Filesystem Size Used Avail Use% Mounted on
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-root 14G 3.8G 8.7G 31% /
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: devtmpfs 990M 0 990M 0% /dev
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 2.1M 994M 1% /run
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /tmp
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/sda2 477M 87M 365M 20% /boot
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: extra arguments at end (ignored)
Jul 15 15:18:07 chicago-fw1 dbus-daemon[457]: dbus[457]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (u
sing servicehelper)
Jul 15 15:18:07 chicago-fw1 dbus[457]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (using servicehelper
)
Jul 15 15:18:07 chicago-fw1 kernel: [ 24.022605] fuse init (API version 7.21)
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounted /firewall-scripts.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Remote File Systems.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Reached target Remote File Systems.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounting FUSE Control File System...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounted FUSE Control File System.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Permit User Sessions...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Permit User Sessions.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Command Scheduler...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Command Scheduler.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Job spooling tools...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Job spooling tools.
.
.
.
Greg Scott
2013-07-15 20:37:00 UTC
Permalink
Hang on a second, I didn't show enough of /var/log/messages from fw1. Here is a larger extract, showing the result of the mount commands. It shows errors mounting my filesystem, but then when I first login, I can see it anyway. Welcome to the Twilight Zone.

I have no idea why libvirtd is starting here. I'm going to get rid of that and reboot again.
.
.
.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started LSB: Bring up/down networking.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Network.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Reached target Network.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started Login and scanning of iSCSI devices.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Virtualization daemon...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started Virtualization daemon.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounting /firewall-scripts...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Vsftpd ftp daemon...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting RPC bind service...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting OpenSSH server daemon...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting /etc/rc.d/rc.local Compatibility...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started Vsftpd ftp daemon.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started RPC bind service.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Making sure the Gluster stuff is mounted
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Mounted before mount -av
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started OpenSSH server daemon.
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Filesystem Size Used Avail Use% Mounted on
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-root 14G 3.8G 8.7G 31% /
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: devtmpfs 990M 0 990M 0% /dev
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 2.1M 994M 1% /run
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /tmp
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/sda2 477M 87M 365M 20% /boot
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: extra arguments at end (ignored)
Jul 15 15:18:07 chicago-fw1 dbus-daemon[457]: dbus[457]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (u
sing servicehelper)
Jul 15 15:18:07 chicago-fw1 dbus[457]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (using servicehelper
)
Jul 15 15:18:07 chicago-fw1 kernel: [ 24.022605] fuse init (API version 7.21)
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounted /firewall-scripts.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Remote File Systems.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Reached target Remote File Systems.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounting FUSE Control File System...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounted FUSE Control File System.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Permit User Sessions...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Permit User Sessions.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Command Scheduler...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Command Scheduler.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Job spooling tools...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Job spooling tools.
Jul 15 15:18:09 chicago-fw1 libvirtd[1001]: libvirt version: 1.0.5.2, package: 1.fc19 (Fedora Project, 2013-06-12-22:00:57, buil
dvm-12.phx2.fedoraproject.org)
Jul 15 15:18:09 chicago-fw1 libvirtd[1001]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_xen.so not accessible
Jul 15 15:18:09 chicago-fw1 libvirtd[1001]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_libxl.so not accessible
Jul 15 15:18:09 chicago-fw1 libvirtd[1001]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible
Jul 15 15:18:09 chicago-fw1 libvirtd[1001]: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_uml.so not accessible
Jul 15 15:18:09 chicago-fw1 avahi-daemon[445]: Registering new address record for fe80::230:18ff:fea2:a340 on enp5s7.*.
Jul 15 15:18:10 chicago-fw1 libvirtd[1001]: open("/var/run/libvirt/network/nwfilter.ltmp"): No such file or directory
Jul 15 15:18:10 chicago-fw1 dbus[457]: [system] Successfully activated service 'org.fedoraproject.Setroubleshootd'
Jul 15 15:18:10 chicago-fw1 dbus-daemon[457]: dbus[457]: [system] Successfully activated service 'org.fedoraproject.Setroublesho
otd'
Jul 15 15:18:10 chicago-fw1 kernel: [ 27.494142] Ebtables v2.0 registered
Jul 15 15:18:11 chicago-fw1 kernel: [ 27.575907] ip6_tables: (C) 2000-2006 Netfilter Core Team
Jul 15 15:18:13 chicago-fw1 setroubleshoot: SELinux is preventing /usr/sbin/glusterfsd from name_bind access on the tcp_socket .
For complete SELinux messages. run sealert -l 7a4fcd5d-209a-4206-ad06-93e6de5e4327
Jul 15 15:18:14 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:14 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:14 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:14 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:14 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:14 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:14 chicago-fw1 audispd: queue is full - dropping event
.
. skippped a zillion of these
.
Jul 15 15:18:15 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:15 chicago-fw1 audispd: queue is full - dropping event
Jul 15 15:18:17 chicago-fw1 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 15 15:18:17 chicago-fw1 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 15:18:17 chicago-fw1 glusterfsd[1264]: [2013-07-15 20:18:17.309107] C [glusterfsd.c:1374:parse_cmdline] 0-glusterfs: ERRO
R: parsing the volfile failed (No such file or directory)
Jul 15 15:18:17 chicago-fw1 GlusterFS[1264]: [2013-07-15 20:18:17.309107] C [glusterfsd.c:1374:parse_cmdline] 0-glusterfs: ERROR
: parsing the volfile failed (No such file or directory)
Jul 15 15:18:17 chicago-fw1 glusterfsd[1264]: USAGE: /usr/sbin/glusterfsd [options] [mountpoint]
Jul 15 15:18:17 chicago-fw1 systemd[1]: glusterfsd.service: control process exited, code=exited status=255
Jul 15 15:18:17 chicago-fw1 systemd[1]: Failed to start GlusterFS an clustered file-system server.
Jul 15 15:18:17 chicago-fw1 systemd[1]: Unit glusterfsd.service entered failed state.
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: Mount failed. Please check the log file for more details.
Jul 15 15:18:20 chicago-fw1 mount[1002]: Mount failed. Please check the log file for more details.
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: / : ignored
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: /boot : already mounted
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: /boot/efi : already mounted
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: /gluster-fw1 : already mounted
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: swap : ignored
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: /firewall-scripts : successfully mounted
Jul 15 15:18:20 chicago-fw1 rc.local[1006]: Mounted after mount -av
Jul 15 15:18:20 chicago-fw1 systemd[1]: firewall\x2dscripts.mount mount process exited, code=exited status=1
Jul 15 15:18:20 chicago-fw1 sedispatch: AVC Message for setroubleshoot, dropping message
Jul 15 15:18:20 chicago-fw1 sedispatch: AVC Message for setroubleshoot, dropping message
Jul 15 15:18:20 chicago-fw1 sedispatch: AVC Message for setroubleshoot, dropping message
Jul 15 15:18:20 chicago-fw1 sedispatch: AVC Message for setroubleshoot, dropping message
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: df: â/firewall-scriptsâ: Transport endpoint is not connected
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: Filesystem Size Used Avail Use% Mounted on
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-root 14G 3.8G 8.7G 31% /
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: devtmpfs 990M 0 990M 0% /dev
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: tmpfs 996M 872K 996M 1% /run
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /tmp
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: /dev/sda2 477M 87M 365M 20% /boot
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 15 15:18:21 chicago-fw1 rc.local[1006]: Starting up firewall common items
Jul 15 15:18:21 chicago-fw1 systemd[1]: Started /etc/rc.d/rc.local Compatibility.
Jul 15 15:18:21 chicago-fw1 systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 15 15:18:21 chicago-fw1 systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 15 15:18:21 chicago-fw1 systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 15 15:18:21 chicago-fw1 systemd[1]: Started Wait for Plymouth Boot Screen to Quit.
Jul 15 15:18:21 chicago-fw1 systemd[1]: Starting Getty on tty1...
Jul 15 15:18:21 chicago-fw1 systemd[1]: Started Getty on tty1.

-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Monday, July 15, 2013 3:29 PM
To: 'Joe Julian'
Cc: gluster-***@gluster.org
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Re: Joe
systemctl disable glusterfsd.service
systemctl enable glusterd.service
Tried this on both nodes and rebooted. Life in the Twilight Zone. First fw1 immediately after logging back in:

[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.8G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 892K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
192.168.253.1:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
[***@chicago-fw1 ~]#
[***@chicago-fw1 ~]# ls /firewall-scripts
allow-all failover-monitor.sh lost+found route-monitor.sh
allow-all-with-nat fwdate.txt rc.firewall start-failover-monitor.sh
etc initial_rc.firewall rcfirewall.conf var
[***@chicago-fw1 ~]#

But it's not mounted on fw2.

[***@chicago-fw2 rc.d]# reboot
login as: root
***@10.10.10.72's password:
Last login: Mon Jul 15 13:53:40 2013 from tinahp100b.infrasupport.local
[***@chicago-fw2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 4.1G 8.4G 33% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 892K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 90M 362M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw2 7.6G 19M 7.2G 1% /gluster-fw2
[***@chicago-fw2 ~]#

Here is an extract from /var/log/messages on fw2.

.
.
.
Jul 15 15:18:26 chicago-fw2 audispd: queue is full - dropping event Jul 15 15:18:26 chicago-fw2 audispd: queue is full - dropping event Jul 15 15:18:28 chicago-fw2 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 15:18:28 chicago-fw2 glusterfsd[1220]: [2013-07-15 20:18:28.304028] C [glusterfsd.c:1374:parse_cmdline] 0-glu
sterfs: ERROR: parsing the volfile failed (No such file or directory) Jul 15 15:18:28 chicago-fw2 glusterfsd[1220]: USAGE: /usr/sbin/glusterfsd [options] [mountpoint] Jul 15 15:18:28 chicago-fw2 GlusterFS[1220]: [2013-07-15 20:18:28.304028] C [glusterfsd.c:1374:parse_cmdline] 0-glus
terfs: ERROR: parsing the volfile failed (No such file or directory) Jul 15 15:18:28 chicago-fw2 systemd[1]: glusterfsd.service: control process exited, code=exited status=255 Jul 15 15:18:28 chicago-fw2 systemd[1]: Failed to start GlusterFS an clustered file-system server.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Unit glusterfsd.service entered failed state.
Jul 15 15:18:28 chicago-fw2 mount[997]: Mount failed. Please check the log file for more details.
Jul 15 15:18:28 chicago-fw2 rpc.statd[1258]: Version 1.2.7 starting Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Mount failed. Please check the log file for more details.
Jul 15 15:18:28 chicago-fw2 systemd[1]: firewall\x2dscripts.mount mount process exited, code=exited status=1 Jul 15 15:18:28 chicago-fw2 systemd[1]: Unit firewall\x2dscripts.mount entered failed state.
Jul 15 15:18:28 chicago-fw2 sm-notify[1259]: Version 1.2.7 starting
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: / : ignored
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /boot : already mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /boot/efi : already mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /gluster-fw2 : already mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: swap : ignored
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /firewall-scripts : successfully mounted
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Mounted after mount -av
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Filesystem Size Used Avail Use% Mounted on
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/mapper/fedora-root 14G 4.1G 8.4G 33% /
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: devtmpfs 990M 0 990M 0% /dev
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 880K 996M 1% /run
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: tmpfs 996M 4.0K 996M 1% /tmp
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/sda2 477M 90M 362M 20% /boot
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: /dev/mapper/fedora-gluster--fw2 7.6G 19M 7.2G 1% /gluster-fw2
Jul 15 15:18:28 chicago-fw2 rc.local[1001]: Starting up firewall common items Jul 15 15:18:28 chicago-fw2 systemd[1]: Started /etc/rc.d/rc.local Compatibility.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 15 15:18:28 chicago-fw2 systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 15 15:18:28 chicago-fw2 systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 15 15:18:28 chicago-fw2 systemd[1]: Started Wait for Plymouth Boot Screen to Quit.
.
.
.

And the extract from /var/log/messages from fw1

.
.
.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting OpenSSH server daemon...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting /etc/rc.d/rc.local Compatibility...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started Vsftpd ftp daemon.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Started RPC bind service.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Making sure the Gluster stuff is mounted Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Mounted before mount -av Jul 15 15:18:07 chicago-fw1 systemd[1]: Started OpenSSH server daemon.
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: Filesystem Size Used Avail Use% Mounted on
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-root 14G 3.8G 8.7G 31% /
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: devtmpfs 990M 0 990M 0% /dev
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 2.1M 994M 1% /run
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: tmpfs 996M 0 996M 0% /tmp
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/sda2 477M 87M 365M 20% /boot
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 15 15:18:07 chicago-fw1 rc.local[1006]: extra arguments at end (ignored) Jul 15 15:18:07 chicago-fw1 dbus-daemon[457]: dbus[457]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (u sing servicehelper) Jul 15 15:18:07 chicago-fw1 dbus[457]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (using servicehelper
)
Jul 15 15:18:07 chicago-fw1 kernel: [ 24.022605] fuse init (API version 7.21)
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounted /firewall-scripts.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Remote File Systems.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Reached target Remote File Systems.
Jul 15 15:18:07 chicago-fw1 systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounting FUSE Control File System...
Jul 15 15:18:07 chicago-fw1 systemd[1]: Mounted FUSE Control File System.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Permit User Sessions...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Permit User Sessions.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Command Scheduler...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Command Scheduler.
Jul 15 15:18:09 chicago-fw1 systemd[1]: Starting Job spooling tools...
Jul 15 15:18:09 chicago-fw1 systemd[1]: Started Job spooling tools.
.
.
.
Greg Scott
2013-07-15 20:44:10 UTC
Permalink
This time after rebooting both nodes, neither one shows /firewall-scripts mounted after a login. But mount -av by hand is successful on both nodes. Fw1 and fw2 both behave identically. Here is what fw1 looks like. Fw2 is identical. This aspect of the problem is screaming timing glitch.

login as: root
***@10.10.10.71's password:
Last login: Mon Jul 15 15:19:41 2013 from tinahp100b.infrasupport.local
[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
[***@chicago-fw1 ~]# mount -av
/ : ignored
/boot : already mounted
/boot/efi : already mounted
/gluster-fw1 : already mounted
swap : ignored
extra arguments at end (ignored)
/firewall-scripts : successfully mounted
[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
192.168.253.1:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
[***@chicago-fw1 ~]#

- Greg
Ben Turner
2013-07-15 21:32:00 UTC
Permalink
Hi Greg. I don't know if this is the thread I replied to before but it still sound to me like your NICs aren't fully up when the gluster mount is getting mounted. The _netdev(at least the version in RHEL 6, I haven't looked at others) doesn't check if the NIC is fully up, it only looks to see if the NW manager lock file exists. When I saw this happen in my tests the lockfile existed but the NIC was still initializing and unable to send/receive traffic to mount the FS. I was able to put a sleep in the initscript to work around this:

# diff -pruN /etc/rc.d/init.d/netfs /tmp/initrd/netfs
--- /etc/rc.d/init.d/netfs 2013-04-26 14:32:28.759283055 -0400
+++ /tmp/initrd/netfs 2013-04-26 14:31:38.320059175 -0400
@@ -32,8 +32,6 @@ NETDEVMTAB=$(LC_ALL=C awk '$4 ~ /_netdev
# See how we were called.
case "$1" in
start)
- echo "Sleeping 30 seconds for NW init workaround -benT"
- sleep 30
[ ! -f /var/lock/subsys/network ] && ! nm-online -x >/dev/null 2>&1 && exit 0
[ "$EUID" != "0" ] && exit 4
[ -n "$NFSFSTAB" ] &&

I just used the sleep for testing, the preferred way of dealing with this is probably using the LINKDELAY option in your /etc/sysconfig/network-scripts/ifcfg-* script. This variable will cause the network scripts to delay $LINKDELAY number of seconds. Can you try testing with either of those options to see if you are able to mount at boot?

-b

----- Original Message -----
Sent: Monday, July 15, 2013 4:44:10 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
This time after rebooting both nodes, neither one shows /firewall-scripts
mounted after a login. But mount -av by hand is successful on both nodes.
Fw1 and fw2 both behave identically. Here is what fw1 looks like. Fw2 is
identical. This aspect of the problem is screaming timing glitch.
login as: root
Last login: Mon Jul 15 15:19:41 2013 from tinahp100b.infrasupport.local
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
/ : ignored
/boot : already mounted
/boot/efi : already mounted
/gluster-fw1 : already mounted
swap : ignored
extra arguments at end (ignored)
/firewall-scripts : successfully mounted
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
192.168.253.1:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Joe Julian
2013-07-15 21:36:06 UTC
Permalink
Ben may be on to something. If you're using NetworkManager, there may be
a chicken and egg problem due to the lack of a hardware link being
established (no switch). What if you mount from localhost?
Post by Ben Turner
# diff -pruN /etc/rc.d/init.d/netfs /tmp/initrd/netfs
--- /etc/rc.d/init.d/netfs 2013-04-26 14:32:28.759283055 -0400
+++ /tmp/initrd/netfs 2013-04-26 14:31:38.320059175 -0400
@@ -32,8 +32,6 @@ NETDEVMTAB=$(LC_ALL=C awk '$4 ~ /_netdev
# See how we were called.
case "$1" in
start)
- echo "Sleeping 30 seconds for NW init workaround -benT"
- sleep 30
[ ! -f /var/lock/subsys/network ] && ! nm-online -x >/dev/null 2>&1 && exit 0
[ "$EUID" != "0" ] && exit 4
[ -n "$NFSFSTAB" ] &&
I just used the sleep for testing, the preferred way of dealing with this is probably using the LINKDELAY option in your /etc/sysconfig/network-scripts/ifcfg-* script. This variable will cause the network scripts to delay $LINKDELAY number of seconds. Can you try testing with either of those options to see if you are able to mount at boot?
-b
----- Original Message -----
Sent: Monday, July 15, 2013 4:44:10 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
This time after rebooting both nodes, neither one shows /firewall-scripts
mounted after a login. But mount -av by hand is successful on both nodes.
Fw1 and fw2 both behave identically. Here is what fw1 looks like. Fw2 is
identical. This aspect of the problem is screaming timing glitch.
login as: root
Last login: Mon Jul 15 15:19:41 2013 from tinahp100b.infrasupport.local
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
/ : ignored
/boot : already mounted
/boot/efi : already mounted
/gluster-fw1 : already mounted
swap : ignored
extra arguments at end (ignored)
/firewall-scripts : successfully mounted
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
192.168.253.1:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Ben Turner
2013-07-15 21:45:03 UTC
Permalink
I have seen managed switches take forever to establish connections as well. I was using a cisco catalyst at the time and IIRC I needed to enable portfast and disable spanning tree.

-b

----- Original Message -----
Sent: Monday, July 15, 2013 5:36:06 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
Ben may be on to something. If you're using NetworkManager, there may be
a chicken and egg problem due to the lack of a hardware link being
established (no switch). What if you mount from localhost?
Post by Ben Turner
Hi Greg. I don't know if this is the thread I replied to before but it
still sound to me like your NICs aren't fully up when the gluster mount is
getting mounted. The _netdev(at least the version in RHEL 6, I haven't
looked at others) doesn't check if the NIC is fully up, it only looks to
see if the NW manager lock file exists. When I saw this happen in my
tests the lockfile existed but the NIC was still initializing and unable
to send/receive traffic to mount the FS. I was able to put a sleep in the
# diff -pruN /etc/rc.d/init.d/netfs /tmp/initrd/netfs
--- /etc/rc.d/init.d/netfs 2013-04-26 14:32:28.759283055 -0400
+++ /tmp/initrd/netfs 2013-04-26 14:31:38.320059175 -0400
@@ -32,8 +32,6 @@ NETDEVMTAB=$(LC_ALL=C awk '$4 ~ /_netdev
# See how we were called.
case "$1" in
start)
- echo "Sleeping 30 seconds for NW init workaround -benT"
- sleep 30
[ ! -f /var/lock/subsys/network ] && ! nm-online -x >/dev/null
2>&1 && exit 0
[ "$EUID" != "0" ] && exit 4
[ -n "$NFSFSTAB" ] &&
I just used the sleep for testing, the preferred way of dealing with this
is probably using the LINKDELAY option in your
/etc/sysconfig/network-scripts/ifcfg-* script. This variable will cause
the network scripts to delay $LINKDELAY number of seconds. Can you try
testing with either of those options to see if you are able to mount at
boot?
-b
----- Original Message -----
Sent: Monday, July 15, 2013 4:44:10 PM
Subject: Re: [Gluster-users] One node goes offline, the other node can't
see the replicated volume anymore
This time after rebooting both nodes, neither one shows /firewall-scripts
mounted after a login. But mount -av by hand is successful on both nodes.
Fw1 and fw2 both behave identically. Here is what fw1 looks like. Fw2 is
identical. This aspect of the problem is screaming timing glitch.
login as: root
Last login: Mon Jul 15 15:19:41 2013 from tinahp100b.infrasupport.local
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
/ : ignored
/boot : already mounted
/boot/efi : already mounted
/gluster-fw1 : already mounted
swap : ignored
extra arguments at end (ignored)
/firewall-scripts : successfully mounted
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
192.168.253.1:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Greg Scott
2013-07-15 21:59:08 UTC
Permalink
Hmmm - I turn off NetworkManager for my application but I can easily sleep a while in rc.local before doing mount -av and see what happens. And I will fix up glusterd.sys
Joe Julian
2013-07-16 15:09:03 UTC
Permalink
Try this: https://gist.github.com/joejulian/6009570 see if it works any
better. We're looking for " GlusterFS an clustered file-system server"
to appear earlier than mounting.
Hmmm - I turn off NetworkManager for my application but I can easily sleep a while in rc.local before doing mount -av and see what happens. And I will fix up glusterd.system. I'll report back here shortly.
- Greg
Greg Scott
2013-07-16 15:30:15 UTC
Permalink
Didn’t seem to make a difference. Not mounted right after logging in. Looks like the same behavior. The mount fails, then my rc.local kicks in and says it succeeded, but doesn’t show it mounted later when I do my “after” df –h.

[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
[***@chicago-fw1 ~]#

[***@chicago-fw1 ~]# tail /var/log/messages -c 50000 | more
0.10.71.
Jul 16 10:21:23 chicago-fw1 avahi-daemon[446]: New relevant interface enp5s7.IPv4 for mDNS.
Jul 16 10:21:23 chicago-fw1 avahi-daemon[446]: Registering new address record for 10.10.10.71 on enp5s7.IPv4.
Jul 16 10:21:23 chicago-fw1 kernel: [ 22.284616] r8169 0000:05:04.0 enp5s4: link up
Jul 16 10:21:24 chicago-fw1 kernel: [ 22.996223] r8169 0000:05:07.0 enp5s7: link up
Jul 16 10:21:24 chicago-fw1 kernel: [ 22.996240] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s7: link becomes ready
Jul 16 10:21:24 chicago-fw1 network[464]: Bringing up interface enp5s7: [ OK ]
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started LSB: Bring up/down networking.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Network.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Reached target Network.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started Login and scanning of iSCSI devices.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Vsftpd ftp daemon...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting RPC bind service...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting OpenSSH server daemon...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting /etc/rc.d/rc.local Compatibility...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started RPC bind service.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started Vsftpd ftp daemon.
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Tue Jul 16 10:21:25 CDT 2013
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Sleeping 30 seconds.
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Tue Jul 16 10:21:25 CDT 2013
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Making sure the Gluster stuff is mounted
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Mounted before mount -av
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Filesystem Size Used Avail Use% Mounted on
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: devtmpfs 990M 0 990M 0% /dev
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /dev/shm
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: tmpfs 996M 2.1M 994M 1% /run
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /tmp
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: /dev/sda2 477M 87M 365M 20% /boot
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started OpenSSH server daemon.
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: extra arguments at end (ignored)
Jul 16 10:21:25 chicago-fw1 dbus-daemon[465]: dbus[465]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (u
sing servicehelper)
Jul 16 10:21:25 chicago-fw1 dbus[465]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (using servicehelper
)
Jul 16 10:21:25 chicago-fw1 kernel: [ 23.918403] fuse init (API version 7.21)
Jul 16 10:21:25 chicago-fw1 systemd[1]: Mounted /firewall-scripts.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Remote File Systems.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Reached target Remote File Systems.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Mounting FUSE Control File System...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Mounted FUSE Control File System.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Starting Permit User Sessions...
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Permit User Sessions.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Starting Command Scheduler...
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Command Scheduler.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Starting Job spooling tools...
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Job spooling tools.
Jul 16 10:21:28 chicago-fw1 avahi-daemon[446]: Registering new address record for fe80::230:18ff:fea2:a340 on enp5s7.*.
Jul 16 10:21:28 chicago-fw1 dbus[465]: [system] Successfully activated service 'org.fedoraproject.Setroubleshootd'
Jul 16 10:21:28 chicago-fw1 dbus-daemon[465]: dbus[465]: [system] Successfully activated service 'org.fedoraproject.Setroubleshootd'
Jul 16 10:21:31 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:31 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:31 chicago-fw1 audispd: queue is full - dropping event
.
.
.
Jul 16 10:21:33 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:33 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:33 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:34 chicago-fw1 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 16 10:21:34 chicago-fw1 systemd[1]: Starting Network is Online.
Jul 16 10:21:34 chicago-fw1 systemd[1]: Reached target Network is Online.
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: Mount failed. Please check the log file for more details.
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: / : ignored
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /boot : already mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /boot/efi : already mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /gluster-fw1 : already mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: swap : ignored
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /firewall-scripts : successfully mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: Mounted after mount -av
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: Filesystem Size Used Avail Use% Mounted on
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: devtmpfs 990M 0 990M 0% /dev
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /dev/shm
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: tmpfs 996M 880K 996M 1% /run
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /tmp
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /dev/sda2 477M 87M 365M 20% /boot
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: Starting up firewall common items
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started /etc/rc.d/rc.local Compatibility.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Wait for Plymouth Boot Screen to Quit.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Getty on tty1...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Getty on tty1.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Login Prompts.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Reached target Login Prompts.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Reached target Multi-User System.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Update UTMP about System Runlevel Changes...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Stop Read-Ahead Data Collection 10s After Completed Startup.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Update UTMP about System Runlevel Changes.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Startup finished in 1.474s (kernel) + 2.210s (initrd) + 33.180s (userspace) = 36.866s.


[***@chicago-fw1 ~]# more /usr/lib/systemd/system/glusterd.service
[Unit]
Description=GlusterFS an clustered file-system server
After=network.target rpcbind.service
Before=network-online.target

[Service]
Type=forking
PIDFile=/run/glusterd.pid
LimitNOFILE=65536
ExecStart=/usr/sbin/glusterd -p /run/glusterd.pid
KillMode=process

[Install]
WantedBy=multi-user.target
[***@chicago-fw1 ~]#


- Greg

From: Joe Julian [mailto:***@julianfamily.org]
Sent: Tuesday, July 16, 2013 10:09 AM
To: Greg Scott
Cc: gluster-***@gluster.org
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Try this: https://gist.github.com/joejulian/6009570 see if it works any better. We're looking for " GlusterFS an clustered file-system server" to appear earlier than mounting.

On 07/15/2013 02:59 PM, Greg Scott wrote:

Hmmm - I turn off NetworkManager for my application but I can easily sleep a while in rc.local before doing mount -av and see what happens. And I will fix up glusterd.system. I'll report back here shortly.



- Greg
Joe Julian
2013-07-16 16:16:05 UTC
Permalink
Get rid of every other mount attempt. No custom systemd script, no
rc.local (I know you start your own app from there, but let's get one
thing working first) and make sure the fstab entry still has the _netdev
option.

Even the guy that wrote systemd (Lennart Pottering) says that we're
correct.

Assuming we are, and you still don't get a mounted filesystem, let's
take another look at the client, brick, and glusterd logs using this
service definition.
Post by Greg Scott
Didn’t seem to make a difference. Not mounted right after logging
in. Looks like the same behavior. The mount fails, then my rc.local
kicks in and says it succeeded, but doesn’t show it mounted later when
I do my “after” df –h.
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
0.10.71.
Jul 16 10:21:23 chicago-fw1 avahi-daemon[446]: New relevant interface enp5s7.IPv4 for mDNS.
Jul 16 10:21:23 chicago-fw1 avahi-daemon[446]: Registering new address
record for 10.10.10.71 on enp5s7.IPv4.
Jul 16 10:21:23 chicago-fw1 kernel: [ 22.284616] r8169 0000:05:04.0 enp5s4: link up
Jul 16 10:21:24 chicago-fw1 kernel: [ 22.996223] r8169 0000:05:07.0 enp5s7: link up
ADDRCONF(NETDEV_CHANGE): enp5s7: link becomes ready
Jul 16 10:21:24 chicago-fw1 network[464]: Bringing up interface enp5s7: [ OK ]
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started LSB: Bring up/down networking.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Network.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Reached target Network.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started Login and scanning of iSCSI devices.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Vsftpd ftp daemon...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting RPC bind service...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting OpenSSH server daemon...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting /etc/rc.d/rc.local Compatibility...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started RPC bind service.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting GlusterFS an
clustered file-system server...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started Vsftpd ftp daemon.
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Tue Jul 16 10:21:25 CDT 2013
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Sleeping 30 seconds.
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Tue Jul 16 10:21:25 CDT 2013
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Making sure the Gluster stuff is mounted
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: Mounted before mount -av
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 2.1M 994M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 16 10:21:25 chicago-fw1 systemd[1]: Started OpenSSH server daemon.
Jul 16 10:21:25 chicago-fw1 rc.local[1005]: extra arguments at end (ignored)
Jul 16 10:21:25 chicago-fw1 dbus-daemon[465]: dbus[465]: [system]
Activating service name='org.fedoraproject.Setroubleshootd' (u
sing servicehelper)
Jul 16 10:21:25 chicago-fw1 dbus[465]: [system] Activating service
name='org.fedoraproject.Setroubleshootd' (using servicehelper
)
Jul 16 10:21:25 chicago-fw1 kernel: [ 23.918403] fuse init (API version 7.21)
Jul 16 10:21:25 chicago-fw1 systemd[1]: Mounted /firewall-scripts.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Remote File Systems.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Reached target Remote File Systems.
Jul 16 10:21:25 chicago-fw1 systemd[1]: Starting Trigger Flushing of
Journal to Persistent Storage...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Mounting FUSE Control File System...
Jul 16 10:21:25 chicago-fw1 systemd[1]: Mounted FUSE Control File System.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Trigger Flushing of
Journal to Persistent Storage.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Starting Permit User Sessions...
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Permit User Sessions.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Starting Command Scheduler...
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Command Scheduler.
Jul 16 10:21:28 chicago-fw1 systemd[1]: Starting Job spooling tools...
Jul 16 10:21:28 chicago-fw1 systemd[1]: Started Job spooling tools.
Jul 16 10:21:28 chicago-fw1 avahi-daemon[446]: Registering new address
record for fe80::230:18ff:fea2:a340 on enp5s7.*.
Jul 16 10:21:28 chicago-fw1 dbus[465]: [system] Successfully activated
service 'org.fedoraproject.Setroubleshootd'
Jul 16 10:21:28 chicago-fw1 dbus-daemon[465]: dbus[465]: [system]
Successfully activated service 'org.fedoraproject.Setroubleshootd'
Jul 16 10:21:31 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:31 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:31 chicago-fw1 audispd: queue is full - dropping event
.
.
.
Jul 16 10:21:33 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:33 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:33 chicago-fw1 audispd: queue is full - dropping event
Jul 16 10:21:34 chicago-fw1 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 16 10:21:34 chicago-fw1 systemd[1]: Starting Network is Online.
Jul 16 10:21:34 chicago-fw1 systemd[1]: Reached target Network is Online.
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: Mount failed. Please check
the log file for more details.
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: / : ignored
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /boot : already mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /boot/efi : already
mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /gluster-fw1 : already mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: swap : ignored
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: /firewall-scripts : successfully mounted
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: Mounted after mount -av
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 880K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 16 10:21:38 chicago-fw1 rc.local[1005]: Starting up firewall common items
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started /etc/rc.d/rc.local Compatibility.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Wait for Plymouth Boot Screen to Quit.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Getty on tty1...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Getty on tty1.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Login Prompts.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Reached target Login Prompts.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Reached target Multi-User System.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Update UTMP about
System Runlevel Changes...
Jul 16 10:21:38 chicago-fw1 systemd[1]: Starting Stop Read-Ahead Data
Collection 10s After Completed Startup.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Stop Read-Ahead Data
Collection 10s After Completed Startup.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Started Update UTMP about System Runlevel Changes.
Jul 16 10:21:38 chicago-fw1 systemd[1]: Startup finished in 1.474s
(kernel) + 2.210s (initrd) + 33.180s (userspace) = 36.866s.
[Unit]
Description=GlusterFS an clustered file-system server
After=network.target rpcbind.service
Before=network-online.target
[Service]
Type=forking
PIDFile=/run/glusterd.pid
LimitNOFILE=65536
ExecStart=/usr/sbin/glusterd -p /run/glusterd.pid
KillMode=process
[Install]
WantedBy=multi-user.target
-Greg
*Sent:* Tuesday, July 16, 2013 10:09 AM
*To:* Greg Scott
*Subject:* Re: [Gluster-users] One node goes offline, the other node
can't see the replicated volume anymore
Try this: https://gist.github.com/joejulian/6009570
<https://gist.github.com/joejulian/6009570> see if it works any
better. We're looking for " GlusterFS an clustered file-system server"
to appear earlier than mounting.
Hmmm - I turn off NetworkManager for my application but I can easily sleep a while in rc.local before doing mount -av and see what happens. And I will fix up glusterd.system. I'll report back here shortly.
- Greg
Greg Scott
2013-07-16 16:52:04 UTC
Permalink
➢ Get rid of every other mount attempt. No custom systemd script, no rc.local (I know you start
➢ your own app from there, but let's get one thing working first) and make sure the fstab entry
➢ still has the _netdev option.

OK, done on both nodes. Fw1 pasted in below. Not mounted after coming back up.

[***@chicago-fw1 ~]# cd /etc/rc.d
[***@chicago-fw1 rc.d]# mv rc.local greg-rc.local
[***@chicago-fw1 rc.d]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 04:26:01 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=818c4142-e389-4f28-a28e-6e26df3caa32 /boot ext4 defaults 1 2
UUID=C57B-BCF9 /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw1 /gluster-fw1 xfs defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw1 rc.d]# reboot
login as: root
***@10.10.10.71's password:
Last login: Tue Jul 16 10:21:33 2013 from tinahp100b.infrasupport.local
[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
[***@chicago-fw1 ~]#

-Greg
Greg Scott
2013-07-16 16:55:57 UTC
Permalink
Holy moley – but it **IS** mounted on fw2. Go figure. Welcome to today’s Twilight Zone episode.

[***@chicago-fw2 systemd]# cd /etc/rc.d
[***@chicago-fw2 rc.d]# mv rc.local greg-rc.local
[***@chicago-fw2 rc.d]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 05:08:55 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=f0cceb6a-61c4-409b-b882-5d6779a52505 /boot ext4 defaults 1 2
UUID=665D-DF0B /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw2 /gluster-fw2 ext4 defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw2 rc.d]# reboot
login as: root
***@10.10.10.72's password:
Last login: Tue Jul 16 10:21:56 2013 from tinahp100b.infrasupport.local
[***@chicago-fw2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 4.2G 8.4G 34% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 90M 362M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw2 7.6G 19M 7.2G 1% /gluster-fw2
localhost:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
[***@chicago-fw2 ~]#



- Greg

From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Tuesday, July 16, 2013 11:52 AM
To: 'Joe Julian'
Cc: gluster-***@gluster.org
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

➢ Get rid of every other mount attempt. No custom systemd script, no rc.local (I know you start
➢ your own app from there, but let's get one thing working first) and make sure the fstab entry
➢ still has the _netdev option.

OK, done on both nodes. Fw1 pasted in below. Not mounted after coming back up.

[***@chicago-fw1 ~]# cd /etc/rc.d
[***@chicago-fw1 rc.d]# mv rc.local greg-rc.local
[***@chicago-fw1 rc.d]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 04:26:01 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=818c4142-e389-4f28-a28e-6e26df3caa32 /boot ext4 defaults 1 2
UUID=C57B-BCF9 /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw1 /gluster-fw1 xfs defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw1 rc.d]# reboot
login as: root
***@10.10.10.71's<mailto:***@10.10.10.71's> password:
Last login: Tue Jul 16 10:21:33 2013 from tinahp100b.infrasupport.local
[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
[***@chicago-fw1 ~]#

-Greg
Greg Scott
2013-07-16 16:58:18 UTC
Permalink
BTW, I know I posted fw1 and fw2 results in different emails, but I rebooted both at the same time.


- Greg

From: Greg Scott
Sent: Tuesday, July 16, 2013 11:56 AM
To: 'Joe Julian'
Cc: gluster-***@gluster.org
Subject: RE: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Holy moley – but it **IS** mounted on fw2. Go figure. Welcome to today’s Twilight Zone episode.

[***@chicago-fw2 systemd]# cd /etc/rc.d
[***@chicago-fw2 rc.d]# mv rc.local greg-rc.local
[***@chicago-fw2 rc.d]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 05:08:55 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=f0cceb6a-61c4-409b-b882-5d6779a52505 /boot ext4 defaults 1 2
UUID=665D-DF0B /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw2 /gluster-fw2 ext4 defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw2 rc.d]# reboot
login as: root
***@10.10.10.72's<mailto:***@10.10.10.72's> password:
Last login: Tue Jul 16 10:21:56 2013 from tinahp100b.infrasupport.local
[***@chicago-fw2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 4.2G 8.4G 34% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 90M 362M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw2 7.6G 19M 7.2G 1% /gluster-fw2
localhost:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
[***@chicago-fw2 ~]#



- Greg

From: gluster-users-***@gluster.org<mailto:gluster-users-***@gluster.org> [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Tuesday, July 16, 2013 11:52 AM
To: 'Joe Julian'
Cc: gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

➢ Get rid of every other mount attempt. No custom systemd script, no rc.local (I know you start
➢ your own app from there, but let's get one thing working first) and make sure the fstab entry
➢ still has the _netdev option.

OK, done on both nodes. Fw1 pasted in below. Not mounted after coming back up.

[***@chicago-fw1 ~]# cd /etc/rc.d
[***@chicago-fw1 rc.d]# mv rc.local greg-rc.local
[***@chicago-fw1 rc.d]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 04:26:01 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=818c4142-e389-4f28-a28e-6e26df3caa32 /boot ext4 defaults 1 2
UUID=C57B-BCF9 /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw1 /gluster-fw1 xfs defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw1 rc.d]# reboot
login as: root
***@10.10.10.71's<mailto:***@10.10.10.71's> password:
Last login: Tue Jul 16 10:21:33 2013 from tinahp100b.infrasupport.local
[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
[***@chicago-fw1 ~]#

-Greg
Greg Scott
2013-07-17 10:59:53 UTC
Permalink
I just rebooted both fw1 and fw2 again with no custom systemd script, no rc.local, all virgin. This time my /firewall-scripts filesystem is mounted on fw1 and not mounted on fw2. So the exact opposite behavior as the same reboot test yesterday.

I’m going to run out of time very soon to tinker with this – the system it’s replacing is 400 miles away and degrading fast.


- Greg


From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Tuesday, July 16, 2013 11:58 AM
To: 'Joe Julian'
Cc: 'gluster-***@gluster.org'
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

BTW, I know I posted fw1 and fw2 results in different emails, but I rebooted both at the same time.


- Greg

From: Greg Scott
Sent: Tuesday, July 16, 2013 11:56 AM
To: 'Joe Julian'
Cc: gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: RE: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Holy moley – but it **IS** mounted on fw2. Go figure. Welcome to today’s Twilight Zone episode.

[***@chicago-fw2 systemd]# cd /etc/rc.d
[***@chicago-fw2 rc.d]# mv rc.local greg-rc.local
[***@chicago-fw2 rc.d]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 05:08:55 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=f0cceb6a-61c4-409b-b882-5d6779a52505 /boot ext4 defaults 1 2
UUID=665D-DF0B /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw2 /gluster-fw2 ext4 defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw2 rc.d]# reboot
login as: root
***@10.10.10.72's<mailto:***@10.10.10.72's> password:
Last login: Tue Jul 16 10:21:56 2013 from tinahp100b.infrasupport.local
[***@chicago-fw2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 4.2G 8.4G 34% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 90M 362M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw2 7.6G 19M 7.2G 1% /gluster-fw2
localhost:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
[***@chicago-fw2 ~]#



- Greg

From: gluster-users-***@gluster.org<mailto:gluster-users-***@gluster.org> [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Tuesday, July 16, 2013 11:52 AM
To: 'Joe Julian'
Cc: gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

➢ Get rid of every other mount attempt. No custom systemd script, no rc.local (I know you start
➢ your own app from there, but let's get one thing working first) and make sure the fstab entry
➢ still has the _netdev option.

OK, done on both nodes. Fw1 pasted in below. Not mounted after coming back up.

[***@chicago-fw1 ~]# cd /etc/rc.d
[***@chicago-fw1 rc.d]# mv rc.local greg-rc.local
[***@chicago-fw1 rc.d]# more /etc/fstab

#
# /etc/fstab
# Created by anaconda on Sat Jul 6 04:26:01 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora-root / ext4 defaults 1 1
UUID=818c4142-e389-4f28-a28e-6e26df3caa32 /boot ext4 defaults 1 2
UUID=C57B-BCF9 /boot/efi vfat umask=0077,shortname=winnt 0 0
/dev/mapper/fedora-gluster--fw1 /gluster-fw1 xfs defaults 1 2
/dev/mapper/fedora-swap swap swap defaults 0 0
# Added gluster stuff Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

[***@chicago-fw1 rc.d]# reboot
login as: root
***@10.10.10.71's<mailto:***@10.10.10.71's> password:
Last login: Tue Jul 16 10:21:33 2013 from tinahp100b.infrasupport.local
[***@chicago-fw1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
[***@chicago-fw1 ~]#

-Greg
Greg Scott
2013-07-17 11:31:46 UTC
Permalink
This combination seems to mount my /firewall-scripts filesystem on both nodes. My application stuff is still commented out for right now. I’ll have to change that very soon so I can finish debugging my own stuff.

I had to put back in the sleep 30 in my glustermount.sh (used to be rc.local). I know sleeping 30 seconds is ugly and I would **much** rather do it more deterministically by making sure it runs in the proper order. But I’m not so sure that “After” line in the .service file really means it. Or maybe “After” means start after those items are started but not necessarily finished. Not really sure. I wish I had a better handle on that.

And then my other systemd issue with all this is, systemd doesn’t seem to record all the output of the scripts it runs. I have a bunch of debug stuff in my glustermount.sh script, but when I do systemctl status glustermount.service –n 50, it is always missing the last 10 or so lines of output from my script. I’ve shown that in a few other posts in this thread. The debug stuff shows a df –h, then the result of mount –av, then another df –h. systemd consistently truncates the output from that second df –h. So when I use systemd to run that script, I never see before and after results. Frustrating.

[***@chicago-fw1 ~]# more /usr/lib/systemd/system/glustermount.service
# This unit is Greg's attempt to mount my Gluster filesystem.
# Ripped off from rc-local.service
# Must run after glusterd.service

[Unit]
Description=Set up Gluster mounts
ConditionFileIsExecutable=/etc/rc.d/glustermount.sh
After=network.target glusterd.service

[Service]
Type=forking
ExecStart=/etc/rc.d/glustermount.sh
TimeoutSec=0
RemainAfterExit=no
SysVStartPriority=99

[Install]
WantedBy=multi-user.target
[***@chicago-fw1 ~]#
[***@chicago-fw1 ~]#
[***@chicago-fw1 ~]# more /etc/rc.d/glustermount.sh
#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
#
# Note removed by default starting in Fedora 16.

touch /var/lock/subsys/local

#***********************************
# Local stuff below

date
echo "Sleeping 30 seconds."
sleep 30
date
echo "Making sure the Gluster stuff is mounted"
echo "Mounted before mount -av"
df -h
mount -av
echo "Mounted after mount -av"
df -h
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't seem
# to work right.

echo "Starting up firewall common items"
## /firewall-scripts/etc/rc.d/common-rc.local
[***@chicago-fw1 ~]#
Greg Scott
2013-07-18 05:00:50 UTC
Permalink
Still not out of the woods. I can get everything mounted on both nodes with my systemd service hack. But now I’m back to the original problem. Well, sort of. Here is the scenario.

My Gluster volume named /firewall-scripts is mounted on both fw1 and fw2. Trying to simulate a cable issue, on fw1, I do:

ifdown enp5s4

And now all access to my /firewall-scripts volume on fw1 goes away. Fw2 can see it after more than the mystical 42 seconds. When I do

ifup enp4s4

I still can’t see my /firewall-scripts volume on fw1 and it is no longer mounted. Not quite one minute later, my volume is mounted again and life goes on.

If that 42 second timeout is settable, how do I set it for a better number for my application? The Gluster/Heartbeat network in this case will just be a cable connecting the two nodes.

Thanks


- Greg
Ben Turner
2013-07-18 14:33:29 UTC
Permalink
You can set the timeout with:

$ gluster volume set <volname> network.ping-timeout <N>

I don't usually set it to anything under 20.

-b

----- Original Message -----
Sent: Thursday, July 18, 2013 1:00:50 AM
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
Still not out of the woods. I can get everything mounted on both nodes with
my systemd service hack. But now I’m back to the original problem. Well,
sort of. Here is the scenario.
My Gluster volume named /firewall-scripts is mounted on both fw1 and fw2.
ifdown enp5s4
And now all access to my /firewall-scripts volume on fw1 goes away. Fw2 can
see it after more than the mystical 42 seconds. When I do
ifup enp4s4
I still can’t see my /firewall-scripts volume on fw1 and it is no longer
mounted. Not quite one minute later, my volume is mounted again and life
goes on.
If that 42 second timeout is settable, how do I set it for a better number
for my application? The Gluster/Heartbeat network in this case will just be
a cable connecting the two nodes.
Thanks
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Greg Scott
2013-07-18 19:37:26 UTC
Permalink
Thanks Ben

- Greg

-----Original Message-----
From: Ben Turner [mailto:***@redhat.com]
Sent: Thursday, July 18, 2013 9:33 AM
To: Greg Scott
Cc: Joe Julian; gluster-***@gluster.org
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

You can set the timeout with:

$ gluster volume set <volname> network.ping-timeout <N>

I don't usually set it to anything under 20.

-b

----- Original Message -----
Sent: Thursday, July 18, 2013 1:00:50 AM
Subject: Re: [Gluster-users] One node goes offline, the other node
can't see the replicated volume anymore
Still not out of the woods. I can get everything mounted on both nodes
with my systemd service hack. But now I’m back to the original
problem. Well, sort of. Here is the scenario.
My Gluster volume named /firewall-scripts is mounted on both fw1 and fw2.
ifdown enp5s4
And now all access to my /firewall-scripts volume on fw1 goes away.
Fw2 can see it after more than the mystical 42 seconds. When I do
ifup enp4s4
I still can’t see my /firewall-scripts volume on fw1 and it is no
longer mounted. Not quite one minute later, my volume is mounted again
and life goes on.
If that 42 second timeout is settable, how do I set it for a better
number for my application? The Gluster/Heartbeat network in this case
will just be a cable connecting the two nodes.
Thanks
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluste
Greg Scott
2013-07-18 21:20:28 UTC
Permalink
Post by Ben Turner
$ gluster volume set <volname> network.ping-timeout <N>
I don't usually set it to anything under 20.
I was just getting ready to put this in, but a bunch more questions are filling my head. The biggie is, how do I look up the current setting? Gluster volume info doesn't show me that number. Is there something else that can show all the detailed settings?

I was thinking of setting it down to as little as 5 seconds. And even 5 seconds might be too long. In my specific use case, my failover script polls the active partner every 10 seconds. By default, if he doesn't respond in 2 intervals (20 seconds), I initiate my failover stuff. When I start a failover, I really really really need that /firewall-scripts directory to be usable. this specific use case, I'm not sure it makes sense to wait 20 seconds. But before I mess with it, I want to see where it's set right now so I have a baseline.

Thanks

- Greg


-----Original Message-----
From: Ben Turner [mailto:***@redhat.com]
Sent: Thursday, July 18, 2013 9:33 AM
To: Greg Scott
Cc: Joe Julian; gluster-***@gluster.org
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

You can set the timeout with:

$ gluster volume set <volname> network.ping-timeout <N>

I don't usually set it to anything under 20.

-b

----- Original Message -----
Post by Ben Turner
Sent: Thursday, July 18, 2013 1:00:50 AM
Subject: Re: [Gluster-users] One node goes offline, the other node
can't see the replicated volume anymore
Still not out of the woods. I can get everything mounted on both nodes
with my systemd service hack. But now I’m back to the original
problem. Well, sort of. Here is the scenario.
My Gluster volume named /firewall-scripts is mounted on both fw1 and fw2.
ifdown enp5s4
And now all access to my /firewall-scripts volume on fw1 goes away.
Fw2 can see it after more than the mystical 42 seconds. When I do
ifup enp4s4
I still can’t see my /firewall-scripts volume on fw1 and it is no
longer mounted. Not quite one minute later, my volume is mounted again
and life goes on.
If that 42 second timeout is settable, how do I set it for a better
number for my application? The Gluster/Heartbeat network in this case
will just be a cable connecting the two nodes.
Thanks
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/m
Marcus Bointon
2013-07-15 22:05:21 UTC
Permalink
Post by Greg Scott
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't seem
# to work right.
It's interesting to see that script - that's what's happens to me with 3.3.0. If I set gluster mounts to mount from fstab with _netdev, it hangs the boot completely and I have to go into single user mode (and edit it out of fstab) to recover, though gluster logs nothing at all. Autofs fails too (though I think that's autofs not understanding mounting an NFS volume from localhost), yet it all works on a manual mount.

Sad to see you're having trouble with 3.4. I hope you can make it work!

Marcus
Greg Scott
2013-07-15 22:12:02 UTC
Permalink
And for what it's worth, I just now looked and noticed rc.local does not really run last in the startup sequence anymore. According to below, it only depends on the network being started. So I could easily be trying my mounts before gluster ever gets fired up.

[***@chicago-fw1 system]# pwd
/usr/lib/systemd/system
[***@chicago-fw1 system]# more rc-local.service
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.

# This unit gets pulled automatically into multi-user.target by
# systemd-rc-local-generator if /etc/rc.d/rc.local is executable.
[Unit]
Description=/etc/rc.d/rc.local Compatibility
ConditionFileIsExecutable=/etc/rc.d/rc.local
After=network.target

[Service]
Type=forking
ExecStart=/etc/rc.d/rc.local start
TimeoutSec=0
RemainAfterExit=yes
SysVStartPriority=99
[***@chicago-fw1 system]#

- Greg


-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Marcus Bointon
Sent: Monday, July 15, 2013 5:05 PM
To: gluster-***@gluster.org List
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
Post by Greg Scott
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't
seem # to work right.
It's interesting to see that script - that's what's happens to me with 3.3.0. If I set gluster mounts to mount from fstab with _netdev, it hangs the boot completely and I have to go into single user mode (and edit it out of fstab) to recover, though gluster logs nothing at all. Autofs fails too (though I think that's autofs not understanding mounting an NFS volume from localhost), yet it all works on a manual mount.

Sad to see you're having trouble with 3.4. I hope you can make it work!

Marcus
Greg Scott
2013-07-15 22:23:37 UTC
Permalink
I think we're making progress. I put in a sleep 30 in my rc.local, rebooted, and my filesystem is now mounted after my first logon.

Still some stuff I don't understand in /var/log/messages, but my before and after mounts look much better. And notice how messages from all kinds of things get mixed in together. So systemd must fire up a bunch of concurrent threads to do its thing. And that's why F19 boots so fast. But the tradeoff is you can't count on things happening in sequence.

I wonder if I can set up one of those service doo-dad files, where I want glusterd started first and then have it run a script to mount my stuff? That would be a more deterministic way to do it versus sleeping 30 seconds in rc.local. I have to go out for a couple of hours. I'll see what I can put together and report results here.

Jul 15 17:13:18 chicago-fw1 audispd: queue is full - dropping event
Jul 15 17:13:18 chicago-fw1 audispd: queue is full - dropping event
Jul 15 17:13:18 chicago-fw1 audispd: queue is full - dropping event
Jul 15 17:13:20 chicago-fw1 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 15 17:13:22 chicago-fw1 mount[1001]: Mount failed. Please check the log file for more details.
Jul 15 17:13:22 chicago-fw1 systemd[1]: firewall\x2dscripts.mount mount process exited, code=exited status=1
Jul 15 17:13:22 chicago-fw1 systemd[1]: Unit firewall\x2dscripts.mount entered failed state.
.
.
. a bazillion meaningless selinux warnings (because selinux=permissive here)
.
.
Jul 15 17:13:40 chicago-fw1 setroubleshoot: SELinux is preventing /usr/sbin/glusterfsd from name_bind access on the tcp_socket .
For complete SELinux messages. run sealert -l 221b72d0-d5d8-4a70-bedd-697a6b9e0f03
Jul 15 17:13:40 chicago-fw1 setroubleshoot: SELinux is preventing /usr/sbin/glusterfsd from name_bind access on the tcp_socket .
For complete SELinux messages. run sealert -l 22b9b899-3fe2-47fc-8c5d-7bd5ed0e1f17
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: Mon Jul 15 17:13:40 CDT 2013
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: Making sure the Gluster stuff is mounted
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: Mounted before mount -av
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: Filesystem Size Used Avail Use% Mounted on
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: devtmpfs 990M 0 990M 0% /dev
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 884K 996M 1% /run
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /tmp
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/sda2 477M 87M 365M 20% /boot
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: extra arguments at end (ignored)
Jul 15 17:13:40 chicago-fw1 setroubleshoot: SELinux is preventing /usr/sbin/glusterfsd from name_bind access on the tcp_socket .
For complete SELinux messages. run sealert -l 225efbe9-0ea3-4f5b-8791-c325d2f0eed6
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: / : ignored
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /boot : already mounted
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /boot/efi : already mounted
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /gluster-fw1 : already mounted
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: swap : ignored
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /firewall-scripts : successfully mounted
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: Mounted after mount -av
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: Filesystem Size Used Avail Use% Mounted on
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: devtmpfs 990M 0 990M 0% /dev
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /dev/shm
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 884K 996M 1% /run
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /sys/fs/cgroup
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: tmpfs 996M 0 996M 0% /tmp
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/sda2 477M 87M 365M 20% /boot
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: 192.168.253.1:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
Jul 15 17:13:40 chicago-fw1 rc.local[1005]: Starting up firewall common items
Jul 15 17:13:40 chicago-fw1 systemd[1]: Started /etc/rc.d/rc.local Compatibility.
Jul 15 17:13:40 chicago-fw1 systemd[1]: Starting Terminate Plymouth Boot Screen...


Greg Scott
Infrasupport Corporation
***@Infrasupport.com

Direct 1-651-260-1051


-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Greg Scott
Sent: Monday, July 15, 2013 5:12 PM
To: 'Marcus Bointon'; gluster-***@gluster.org List
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

And for what it's worth, I just now looked and noticed rc.local does not really run last in the startup sequence anymore. According to below, it only depends on the network being started. So I could easily be trying my mounts before gluster ever gets fired up.

[***@chicago-fw1 system]# pwd
/usr/lib/systemd/system
[***@chicago-fw1 system]# more rc-local.service # This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version.


# This unit gets pulled automatically into multi-user.target by # systemd-rc-local-generator if /etc/rc.d/rc.local is executable.
[Unit]
Description=/etc/rc.d/rc.local Compatibility ConditionFileIsExecutable=/etc/rc.d/rc.local
After=network.target

[Service]
Type=forking
ExecStart=/etc/rc.d/rc.local start
TimeoutSec=0
RemainAfterExit=yes
SysVStartPriority=99
[***@chicago-fw1 system]#

- Greg


-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Marcus Bointon
Sent: Monday, July 15, 2013 5:05 PM
To: gluster-***@gluster.org List
Subject: Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore
Post by Greg Scott
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't
seem # to work right.
It's interesting to see that script - that's what's happens to me with 3.3.0. If I set gluster mounts to mount from fstab with _netdev, it hangs the boot completely and I have to go into single user mode (and edit it out of fstab) to recover, though gluster logs nothing at all. Autofs fails too (though I think that's autofs not understanding mounting an NFS volume from localhost), yet it all works on a manual mount.

Sad to see you're having trouble with 3.4. I hope you can make it work!

Marcus
Greg Scott
2013-07-16 00:58:20 UTC
Permalink
Back to the Twilight Zone again.

I removed my rc.local this time and did a reboot, so the fstab mounts should have taken care of it. But they didn't. The fstab line looks like this now on both nodes:

localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

After logon, my firewall-scripts filesystem is not mounted. A mount -av by hand mounts it up.

Here is the extract from /var/log/messages. I noticed a couple of mentions of mounts.
.
.
.
Jul 15 19:39:56 chicago-fw1 network[457]: Bringing up interface enp5s7: [ OK ]
Jul 15 19:39:56 chicago-fw1 systemd[1]: Started LSB: Bring up/down networking.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Starting Network.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Reached target Network.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Started Login and scanning of iSCSI devices.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Mounting /firewall-scripts...
Jul 15 19:39:56 chicago-fw1 systemd[1]: Starting Vsftpd ftp daemon...
Jul 15 19:39:56 chicago-fw1 systemd[1]: Starting RPC bind service...
Jul 15 19:39:56 chicago-fw1 systemd[1]: Starting OpenSSH server daemon...
Jul 15 19:39:56 chicago-fw1 systemd[1]: Started RPC bind service.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Started Vsftpd ftp daemon.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Starting GlusterFS an clustered file-system server...
Jul 15 19:39:56 chicago-fw1 systemd[1]: Started OpenSSH server daemon.
Jul 15 19:39:56 chicago-fw1 dbus-daemon[458]: dbus[458]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (u
sing servicehelper)
Jul 15 19:39:56 chicago-fw1 dbus[458]: [system] Activating service name='org.fedoraproject.Setroubleshootd' (using servicehelper
)
Jul 15 19:39:56 chicago-fw1 kernel: [ 24.267903] fuse init (API version 7.21)
Jul 15 19:39:56 chicago-fw1 systemd[1]: Mounted /firewall-scripts.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Starting Remote File Systems.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Reached target Remote File Systems.
Jul 15 19:39:56 chicago-fw1 systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Jul 15 19:39:56 chicago-fw1 systemd[1]: Mounting FUSE Control File System...
Jul 15 19:39:59 chicago-fw1 systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
Jul 15 19:39:59 chicago-fw1 systemd[1]: Mounted FUSE Control File System.
Jul 15 19:39:59 chicago-fw1 systemd[1]: Starting Permit User Sessions...
Jul 15 19:39:59 chicago-fw1 systemd[1]: Started Permit User Sessions.
Jul 15 19:39:59 chicago-fw1 systemd[1]: Starting Command Scheduler...
Jul 15 19:39:59 chicago-fw1 systemd[1]: Started Command Scheduler.
Jul 15 19:39:59 chicago-fw1 systemd[1]: Starting Job spooling tools...
Jul 15 19:39:59 chicago-fw1 systemd[1]: Started Job spooling tools.
Jul 15 19:39:59 chicago-fw1 systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 15 19:39:59 chicago-fw1 systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 15 19:39:59 chicago-fw1 systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 15 19:39:59 chicago-fw1 avahi-daemon[445]: Registering new address record for fe80::230:18ff:fea2:a340 on enp5s7.*.
Jul 15 19:39:59 chicago-fw1 dbus[458]: [system] Successfully activated service 'org.fedoraproject.Setroubleshootd'
Jul 15 19:39:59 chicago-fw1 dbus-daemon[458]: dbus[458]: [system] Successfully activated service 'org.fedoraproject.Setroublesho
otd'
Jul 15 19:40:02 chicago-fw1 audispd: queue is full - dropping event
Jul 15 19:40:02 chicago-fw1 audispd: queue is full - dropping event
.
.
. zillions more "queue is full" messages
.
.
Jul 15 19:40:04 chicago-fw1 audispd: queue is full - dropping event
Jul 15 19:40:04 chicago-fw1 audispd: queue is full - dropping event
Jul 15 19:40:05 chicago-fw1 systemd[1]: Started GlusterFS an clustered file-system server.
Jul 15 19:40:05 chicago-fw1 systemd[1]: Starting Multi-User System.
Jul 15 19:40:05 chicago-fw1 systemd[1]: Reached target Multi-User System.
Jul 15 19:40:05 chicago-fw1 systemd[1]: Starting Update UTMP about System Runlevel Changes...
Jul 15 19:40:05 chicago-fw1 systemd[1]: Starting Stop Read-Ahead Data Collection 10s After Completed Startup.
Jul 15 19:40:05 chicago-fw1 systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Jul 15 19:40:05 chicago-fw1 systemd[1]: Started Update UTMP about System Runlevel Changes.
Jul 15 19:40:05 chicago-fw1 systemd[1]: Startup finished in 1.482s (kernel) + 2.210s (initrd) + 29.710s (userspace) = 33.403s.
Jul 15 19:40:06 chicago-fw1 mount[1000]: Mount failed. Please check the log file for more details.
Jul 15 19:40:06 chicago-fw1 systemd[1]: firewall\x2dscripts.mount mount process exited, code=exited status=1
Jul 15 19:40:06 chicago-fw1 systemd[1]: Unit firewall\x2dscripts.mount entered failed state.
Jul 15 19:40:06 chicago-fw1 rpc.statd[1184]: Version 1.2.7 starting
Jul 15 19:40:06 chicago-fw1 sm-notify[1185]: Version 1.2.7 starting
Jul 15 19:40:06 chicago-fw1 setroubleshoot: Plugin Exception catchall_labels
Jul 15 19:40:06 chicago-fw1 setroubleshoot: SELinux is preventing /usr/sbin/glusterfsd from mounton access on the directory /fir
ewall-scripts. For complete SELinux messages. run sealert -l 7fb3c8ad-94f4-4292-b4ee-2495b452ef4b
.
.
.
Greg Scott
2013-07-16 01:16:21 UTC
Permalink
Here is a tail from /var/log/glusterfs/firewall-scripts.log. The entries at 00:39:56 GMT must correspond to /var/log/messages at 19:39:56 USA Central time. There are no firewall rules and the default policy is ACCEPT. With no rc.local file, my application is out of the picture right now, so nobody is sleeping or taking themselves offline or anything like that. Just a straight up boot using fstab to mount the filesystems.
.
.
.
[2013-07-16 00:36:40.495183] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child
firewall-scripts-client-0
[2013-07-16 00:39:12.584585] W [socket.c:514:__socket_rwv] 0-glusterfs: readv failed (No data available)
[2013-07-16 00:39:12.584853] W [socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (No d
ata available), peer (127.0.0.1:24007)
[2013-07-16 00:39:12.714225] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7fd5c27f913d] (-->/us
r/lib64/libpthread.so.0(+0x33c1607c53) [0x7fd5c2e8fc53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fd5c3b7de35]))) 0-
: received signum (15), shutting down
[2013-07-16 00:39:12.714289] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.
[2013-07-16 00:39:12.715170] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
[2013-07-16 00:39:56.667185] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0
beta4 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=localhost /firewall-scripts)
[2013-07-16 00:39:56.726785] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-07-16 00:39:56.726993] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
[2013-07-16 00:40:05.794794] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
[2013-07-16 00:40:05.794927] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
[2013-07-16 00:40:05.801351] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
[2013-07-16 00:40:05.801486] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
[2013-07-16 00:40:05.801611] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting conn
ect on transport
[2013-07-16 00:40:05.817724] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting conn
ect on transport
Given volfile:
+------------------------------------------------------------------------------+
1: volume firewall-scripts-client-0
2: type protocol/client
3: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
4: option username de6eacd1-31bc-4bdb-a049-776cd840059e
5: option transport-type tcp
6: option remote-subvolume /gluster-fw1
7: option remote-host 192.168.253.1
8: end-volume
9:
10: volume firewall-scripts-client-1
11: type protocol/client
12: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
13: option username de6eacd1-31bc-4bdb-a049-776cd840059e
14: option transport-type tcp
15: option remote-subvolume /gluster-fw2
16: option remote-host 192.168.253.2
17: end-volume
18:
19: volume firewall-scripts-replicate-0
20: type cluster/replicate
21: subvolumes firewall-scripts-client-0 firewall-scripts-client-1
22: end-volume
23:
24: volume firewall-scripts-dht
25: type cluster/distribute
26: subvolumes firewall-scripts-replicate-0
27: end-volume
28:
29: volume firewall-scripts-write-behind
30: type performance/write-behind
31: subvolumes firewall-scripts-dht
32: end-volume
33:
34: volume firewall-scripts-read-ahead
35: type performance/read-ahead
36: subvolumes firewall-scripts-write-behind
37: end-volume
38:
39: volume firewall-scripts-io-cache
40: type performance/io-cache
41: subvolumes firewall-scripts-read-ahead
42: end-volume
43:
44: volume firewall-scripts-quick-read
45: type performance/quick-read
46: subvolumes firewall-scripts-io-cache
47: end-volume
48:
49: volume firewall-scripts-open-behind
50: type performance/open-behind
51: subvolumes firewall-scripts-quick-read
52: end-volume
53:
54: volume firewall-scripts-md-cache
55: type performance/md-cache
56: subvolumes firewall-scripts-open-behind
57: end-volume
58:
59: volume firewall-scripts
60: type debug/io-stats
61: option count-fop-hits off
62: option latency-measurement off
63: subvolumes firewall-scripts-md-cache
64: end-volume

+------------------------------------------------------------------------------+
[2013-07-16 00:40:05.975356] E [client-handshake.c:1741:client_query_portmap_cbk] 0-firewall-scripts-client-0: failed to get the
port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2013-07-16 00:40:05.975588] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-16 00:40:05.975708] I [client.c:2097:client_rpc_notify] 0-firewall-scripts-client-0: disconnected
[2013-07-16 00:40:06.027821] E [client-handshake.c:1741:client_query_portmap_cbk] 0-firewall-scripts-client-1: failed to get the
port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2013-07-16 00:40:06.028010] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
[2013-07-16 00:40:06.028103] I [client.c:2097:client_rpc_notify] 0-firewall-scripts-client-1: disconnected
[2013-07-16 00:40:06.028148] E [afr-common.c:3735:afr_notify] 0-firewall-scripts-replicate-0: All subvolumes are down. Going off
line until atleast one of them comes back up.
[2013-07-16 00:40:06.048172] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-16 00:40:06.049068] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.
13 kernel 7.21
[2013-07-16 00:40:06.051158] W [fuse-bridge.c:665:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (No such file or director
y)
[2013-07-16 00:40:06.077420] I [fuse-bridge.c:4583:fuse_thread_proc] 0-fuse: unmounting /firewall-scripts
[2013-07-16 00:40:06.078427] W [glusterfsd.c:970:cleanup_and_exit] (-->/usr/lib64/libc.so.6(clone+0x6d) [0x7f1f0078d13d] (-->/us
r/lib64/libpthread.so.0(+0x33c1607c53) [0x7f1f00e23c53] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7f1f01b11e35]))) 0-
: received signum (15), shutting down
[2013-07-16 00:40:06.078501] I [fuse-bridge.c:5212:fini] 0-fuse: Unmounting '/firewall-scripts'.
[2013-07-16 00:53:39.844556] I [glusterfsd.c:1878:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0
beta4 (/usr/sbin/glusterfs --volfile-id=/firewall-scripts --volfile-server=localhost /firewall-scripts)
[2013-07-16 00:53:39.858957] I [socket.c:3480:socket_init] 0-glusterfs: SSL support is NOT enabled
[2013-07-16 00:53:39.859117] I [socket.c:3495:socket_init] 0-glusterfs: using system polling thread
[2013-07-16 00:53:39.907716] I [socket.c:3480:socket_init] 0-firewall-scripts-client-1: SSL support is NOT enabled
[2013-07-16 00:53:39.907881] I [socket.c:3495:socket_init] 0-firewall-scripts-client-1: using system polling thread
[2013-07-16 00:53:39.909563] I [socket.c:3480:socket_init] 0-firewall-scripts-client-0: SSL support is NOT enabled
[2013-07-16 00:53:39.909655] I [socket.c:3495:socket_init] 0-firewall-scripts-client-0: using system polling thread
[2013-07-16 00:53:39.909778] I [client.c:2154:notify] 0-firewall-scripts-client-0: parent translators are ready, attempting conn
ect on transport
[2013-07-16 00:53:39.920933] I [client.c:2154:notify] 0-firewall-scripts-client-1: parent translators are ready, attempting conn
ect on transport
Given volfile:
+------------------------------------------------------------------------------+
1: volume firewall-scripts-client-0
2: type protocol/client
3: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
4: option username de6eacd1-31bc-4bdb-a049-776cd840059e
5: option transport-type tcp
6: option remote-subvolume /gluster-fw1
7: option remote-host 192.168.253.1
8: end-volume
9:
10: volume firewall-scripts-client-1
11: type protocol/client
12: option password fb3955b7-a6ca-49bb-b886-d4b6609392f8
13: option username de6eacd1-31bc-4bdb-a049-776cd840059e
14: option transport-type tcp
15: option remote-subvolume /gluster-fw2
16: option remote-host 192.168.253.2
17: end-volume
18:
19: volume firewall-scripts-replicate-0
20: type cluster/replicate
21: subvolumes firewall-scripts-client-0 firewall-scripts-client-1
22: end-volume
23:
24: volume firewall-scripts-dht
25: type cluster/distribute
26: subvolumes firewall-scripts-replicate-0
27: end-volume
28:
29: volume firewall-scripts-write-behind
30: type performance/write-behind
31: subvolumes firewall-scripts-dht
32: end-volume
33:
34: volume firewall-scripts-read-ahead
35: type performance/read-ahead
36: subvolumes firewall-scripts-write-behind
37: end-volume
38:
39: volume firewall-scripts-io-cache
40: type performance/io-cache
41: subvolumes firewall-scripts-read-ahead
42: end-volume
43:
44: volume firewall-scripts-quick-read
45: type performance/quick-read
46: subvolumes firewall-scripts-io-cache
47: end-volume
48:
49: volume firewall-scripts-open-behind
50: type performance/open-behind
51: subvolumes firewall-scripts-quick-read
52: end-volume
53:
54: volume firewall-scripts-md-cache
55: type performance/md-cache
56: subvolumes firewall-scripts-open-behind
57: end-volume
58:
59: volume firewall-scripts
60: type debug/io-stats
61: option count-fop-hits off
62: option latency-measurement off
63: subvolumes firewall-scripts-md-cache
64: end-volume

+------------------------------------------------------------------------------+
[2013-07-16 00:53:39.933009] I [rpc-clnt.c:1676:rpc_clnt_reconfig] 0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-16 00:53:39.933178] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-16 00:53:39.950457] I [rpc-clnt.c:1676:rpc_clnt_reconfig] 0-firewall-scripts-client-1: changing port to 49152 (from 0)
[2013-07-16 00:53:39.950621] W [socket.c:514:__socket_rwv] 0-firewall-scripts-client-1: readv failed (No data available)
[2013-07-16 00:53:39.966646] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-0: Using Pro
gram GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-16 00:53:39.966994] I [client-handshake.c:1658:select_server_supported_programs] 0-firewall-scripts-client-1: Using Pro
gram GlusterFS 3.3, Num (1298437), Version (330)
[2013-07-16 00:53:39.967417] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: Connected to 192.168.
253.1:49152, attached to remote volume '/gluster-fw1'.
[2013-07-16 00:53:39.967498] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: Server and Client lk-
version numbers are not same, reopening the fds
[2013-07-16 00:53:39.967800] I [afr-common.c:3698:afr_notify] 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client
-0' came back up; going online.
[2013-07-16 00:53:39.967959] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: Server lk version
= 1
[2013-07-16 00:53:39.968068] I [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-1: Connected to 192.168.
253.2:49152, attached to remote volume '/gluster-fw2'.
[2013-07-16 00:53:39.968114] I [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-1: Server and Client lk-
version numbers are not same, reopening the fds
[2013-07-16 00:53:39.982434] I [fuse-bridge.c:4723:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-07-16 00:53:39.982877] I [client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-1: Server lk version
= 1
[2013-07-16 00:53:39.983205] I [fuse-bridge.c:3680:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.
13 kernel 7.21
[2013-07-16 00:53:39.984516] I [afr-common.c:2057:afr_set_root_inode_on_first_lookup] 0-firewall-scripts-replicate-0: added root
inode
[2013-07-16 00:53:39.985412] I [afr-common.c:2120:afr_discovery_cbk] 0-firewall-scripts-replicate-0: selecting local read_child
firewall-scripts-client-0
[***@chicago-fw1 ~]#
Robert Hajime Lanning
2013-07-16 01:42:11 UTC
Permalink
Post by Greg Scott
Here is a tail from /var/log/glusterfs/firewall-scripts.log. The entries at 00:39:56 GMT must correspond to /var/log/messages at 19:39:56 USA Central time. There are no firewall rules and the default policy is ACCEPT. With no rc.local file, my application is out of the picture right now, so nobody is sleeping or taking themselves offline or anything like that. Just a straight up boot using fstab to mount the filesystems.
Run "chkconfig --list netfs" and make sure it is set to "on" for your
default run level.
--
Mr. Flibble
King of the Potato People
Robert Hajime Lanning
2013-07-16 01:45:28 UTC
Permalink
Post by Robert Hajime Lanning
Post by Greg Scott
Here is a tail from /var/log/glusterfs/firewall-scripts.log. The
entries at 00:39:56 GMT must correspond to /var/log/messages at
19:39:56 USA Central time. There are no firewall rules and the
default policy is ACCEPT. With no rc.local file, my application is
out of the picture right now, so nobody is sleeping or taking
themselves offline or anything like that. Just a straight up boot
using fstab to mount the filesystems.
Run "chkconfig --list netfs" and make sure it is set to "on" for your
default run level.
Never mind. I didn't read the log completely.
You are getting a read error with "no data available".
--
Mr. Flibble
King of the Potato People
Greg Scott
2013-07-16 02:29:01 UTC
Permalink
Post by Robert Hajime Lanning
Never mind. I didn't read the log completely.
You are getting a read error with "no data available".
Well, OK. So how do I figure out what its problem might be so I can fix it? I wonder if turning on that debug logging level Joe mentioned a few posts ago might be helpful?

Also, chkconfig is pretty much obsolete in the Fedora world by now. Coming soon to a RHEL near you.

- Greg
Joe Julian
2013-07-16 04:42:03 UTC
Permalink
It does look like a race condition. I have an idea that I want to run
past Kaleb tomorrow morning. I'll get back to you then.
Post by Greg Scott
Post by Robert Hajime Lanning
Never mind. I didn't read the log completely.
You are getting a read error with "no data available".
Well, OK. So how do I figure out what its problem might be so I can fix it? I wonder if turning on that debug logging level Joe mentioned a few posts ago might be helpful?
Also, chkconfig is pretty much obsolete in the Fedora world by now. Coming soon to a RHEL near you.
- Greg
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Greg Scott
2013-07-16 05:38:33 UTC
Permalink
Hmmm - I mount my firewall-scripts volume like this now in fstab:

localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

But gluster volume info still looks like this on both nodes:

[***@chicago-fw1 system]# gluster volume info

Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[***@chicago-fw1 system]#

Seems to me, my Gluster volume would still have some dependency on networking, even though I mount it with localhost in fstab, right?

In all our trial and error, the only thing that's worked so far is putting in that 30 second delay in my custom rc.local file. So instead of fighting this, why not try my hand at a systemd service file named, say, glustermount.service.

So here is the glustermout.service file I made. It runs a script named /etc/rc.d/glustermount.sh pasted in below.

[***@chicago-fw1 rc.d]# cd /usr/lib/systemd/system
[***@chicago-fw1 system]# more glustermount.service
# This unit is Greg's attempt to mount my Gluster filesystem.
[Unit]
Description=Set up Gluster mounts
ConditionFileIsExecutable=/etc/rc.d/glustermount.sh
After=network.target glusterd.service

[Service]
Type=oneshot
ExecStart=/etc/rc.d/glustermount.sh
TimeoutSec=0
RemainAfterExit=no
SysVStartPriority=99

[Install]
WantedBy=multi-user.target
[***@chicago-fw1 system]#

Here is glustermount.sh as it exists right now. Really just a renamed rc.local I've been using the whole time, now with some debug echo lines.

[***@chicago-fw1 rc.d]# more glustermount.sh
#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
#
# Note removed by default starting in Fedora 16.

touch /var/lock/subsys/local

#***********************************
# Local stuff below

date
echo "Sleeping 30 seconds."
### sleep 30
date
echo "Making sure the Gluster stuff is mounted"
echo "Mounted before mount -av"
df -h
mount -av
echo "Mounted after mount -av"
df -h
# The fstab mounts happen early in startup, then Gluster starts up later.
# By now, Gluster should be up and running and the mounts should work.
# That _netdev option is supposed to account for the delay but doesn't seem
# to work right.

echo "Starting up firewall common items"
## /firewall-scripts/etc/rc.d/common-rc.local
[***@chicago-fw1 rc.d]#

I should see mounts before and after when I look at a systemctl status. But I don't. And systemctl status shows my /firewall-scripts filesystem successfully mounted, but when I go to look with df -h, it's not mounted. Yet another midnight mystery - take a look.

[***@chicago-fw1 system]# systemctl start glustermount.service
[***@chicago-fw1 system]# systemctl status glustermount.service -n 50
glustermount.service - Set up Gluster mounts
Loaded: loaded (/usr/lib/systemd/system/glustermount.service; enabled)
Active: inactive (dead) since Tue 2013-07-16 00:27:36 CDT; 5s ago
Process: 2464 ExecStart=/etc/rc.d/glustermount.sh (code=exited, status=0/SUCCESS)

Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: Tue Jul 16 00:27:35 CDT 2013
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: Sleeping 30 seconds.
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: Tue Jul 16 00:27:35 CDT 2013
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: Making sure the Gluster stuff is mounted
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: Mounted before mount -av
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: Filesystem Size Used Avail Use% Mounted on
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: /dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: devtmpfs 990M 0 990M 0% /dev
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: tmpfs 996M 0 996M 0% /dev/shm
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: tmpfs 996M 888K 996M 1% /run
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: tmpfs 996M 0 996M 0% /sys/...roup
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: tmpfs 996M 0 996M 0% /tmp
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: /dev/sda2 477M 87M 365M 20% /boot
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: /dev/sda1 200M 9.4M 191M 5% /boot/efi
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: /dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
Jul 16 00:27:35 chicago-fw1 glustermount.sh[2464]: extra arguments at end (ignored)
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: / : ignored
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: /boot : already mounted
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: /boot/efi : already mounted
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: /gluster-fw1 : already mounted
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: swap : ignored
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: /firewall-scripts : successfully mounted
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: Mounted after mount -av
Jul 16 00:27:36 chicago-fw1 glustermount.sh[2464]: Filesystem Size Used Avail Use% Mounted on
Jul 16 00:27:36 chicago-fw1 systemd[1]: Started Set up Gluster mounts.
[***@chicago-fw1 system]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
[***@chicago-fw1 system]#
Finally, running the same glustermount.sh script by hand works just fine. It **only** breaks when run from systemd. Here is what happens when I run it by hand. Compare below to what happens above when I tell systemd to run it.

[***@chicago-fw1 system]#
[***@chicago-fw1 system]# /etc/rc.d/glustermount.sh
Tue Jul 16 00:23:15 CDT 2013
Sleeping 30 seconds.
Tue Jul 16 00:23:15 CDT 2013
Making sure the Gluster stuff is mounted
Mounted before mount -av
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
/ : ignored
/boot : already mounted
/boot/efi : already mounted
/gluster-fw1 : already mounted
swap : ignored
extra arguments at end (ignored)
/firewall-scripts : successfully mounted
Mounted after mount -av
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
localhost:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
Starting up firewall common items
[***@chicago-fw1 system]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/fedora-root 14G 3.9G 8.7G 31% /
devtmpfs 990M 0 990M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 888K 996M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 0 996M 0% /tmp
/dev/sda2 477M 87M 365M 20% /boot
/dev/sda1 200M 9.4M 191M 5% /boot/efi
/dev/mapper/fedora-gluster--fw1 7.9G 33M 7.8G 1% /gluster-fw1
localhost:/firewall-scripts 7.6G 19M 7.2G 1% /firewall-scripts
[***@chicago-fw1 system]#

- Greg
Joe Julian
2013-07-16 06:08:26 UTC
Permalink
Post by Greg Scott
localhost:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0
Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
Seems to me, my Gluster volume would still have some dependency on networking, even though I mount it with localhost in fstab, right?
In all our trial and error, the only thing that's worked so far is putting in that 30 second delay in my custom rc.local file. So instead of fighting this, why not try my hand at a systemd service file named, say, glustermount.service.
So here is the glustermout.service file I made. It runs a script named /etc/rc.d/glustermount.sh pasted in below.
The only reason why not would be because it should be able to mount
during the normal netfs mount process. I have an idea how to fix that
properly. Your hack would certainly work though.
Greg Scott
2013-07-16 13:52:03 UTC
Permalink
Post by Joe Julian
Your hack would certainly work though.
Except that it didn't work.

I probably should have broken that monster post last night into smaller pieces. The summary is, my glustermount.sh script breaks when run from systemd - even though it claims success, and it runs just fine by hand.

Details are a few bazillion lines down in the monster post I put together last night.

- Greg
Greg Scott
2013-07-16 14:31:21 UTC
Permalink
I'm the first to admit, I don't understand systemd yet. I wanted to know - how do you know what services you're starting up? What's the systemd equivalent of chkconfig --list? Well, maybe I found it - just type "systemctl" with no switches. So check out the differences between fw1 and fw2. I have no idea what this means but I think it might be relevant.

First, systemctl on fw1. Notice everything is loaded and active except a mystery service named rng.service I don't think we care about. I'll paste in all 118 lines from fw1. Now scroll down and look at the extract from fw2. So nobody is forced to wade through 200+ lines of output, I'll only paste in the relevant service that failed on fw2, named firewall\x2dscripts.mount.

This same service works on fw1, fails on fw2 with a status of "No such file or directory" - status posted below. I wonder what it does? That backslash character in a filename seems fishy. The only reference I can find anywhere is in a directory, /run/systemd/generator. I'll paste in what it looks like at the very bottom.

[***@chicago-fw1 rc.d]# systemctl
UNIT LOAD ACTIVE SUB DESCRIPTION
proc-sys-fs-binfmt_misc.automount loaded active waiting Arbitrary Executable File Formats File System Automo
sys-devices-pci0...-0000:00:1b.0-sound-card0.device loaded active plugged NM10/ICH7 Family High Definition Audio Controller
sys-devices-pci0...1-0000:02:00.0-net-enp2s0.device loaded active plugged RTL8111/8168 PCI Express Gigabit Ethernet controller
sys-devices-pci0...2-0000:03:00.0-net-enp3s0.device loaded active plugged RTL8111/8168 PCI Express Gigabit Ethernet controller
sys-devices-pci0...0-0000:05:04.0-net-enp5s4.device loaded active plugged RTL-8110SC/8169SC Gigabit Ethernet
sys-devices-pci0...0-0000:05:06.0-net-enp5s6.device loaded active plugged RTL-8110SC/8169SC Gigabit Ethernet
sys-devices-pci0...0-0000:05:07.0-net-enp5s7.device loaded active plugged RTL-8110SC/8169SC Gigabit Ethernet
sys-devices-pci0...:0-2:0:0:0-block-sda-sda1.device loaded active plugged SanDisk_SDSSDRC032G
sys-devices-pci0...:0-2:0:0:0-block-sda-sda2.device loaded active plugged SanDisk_SDSSDRC032G
sys-devices-pci0...:0-2:0:0:0-block-sda-sda3.device loaded active plugged SanDisk_SDSSDRC032G
sys-devices-pci0...et2:0:0-2:0:0:0-block-sda.device loaded active plugged SanDisk_SDSSDRC032G
sys-devices-platform-serial8250-tty-ttyS2.device loaded active plugged /sys/devices/platform/serial8250/tty/ttyS2
sys-devices-platform-serial8250-tty-ttyS3.device loaded active plugged /sys/devices/platform/serial8250/tty/ttyS3
sys-devices-pnp0-00:0a-tty-ttyS0.device loaded active plugged /sys/devices/pnp0/00:0a/tty/ttyS0
sys-devices-pnp0-00:0b-tty-ttyS1.device loaded active plugged /sys/devices/pnp0/00:0b/tty/ttyS1
sys-devices-virtual-block-dm\x2d0.device loaded active plugged /sys/devices/virtual/block/dm-0
sys-devices-virtual-block-dm\x2d1.device loaded active plugged /sys/devices/virtual/block/dm-1
sys-devices-virtual-block-dm\x2d2.device loaded active plugged /sys/devices/virtual/block/dm-2
sys-module-configfs.device loaded active plugged /sys/module/configfs
sys-module-fuse.device loaded active plugged /sys/module/fuse
sys-subsystem-net-devices-enp2s0.device loaded active plugged RTL8111/8168 PCI Express Gigabit Ethernet controller
sys-subsystem-net-devices-enp3s0.device loaded active plugged RTL8111/8168 PCI Express Gigabit Ethernet controller
sys-subsystem-net-devices-enp5s4.device loaded active plugged RTL-8110SC/8169SC Gigabit Ethernet
sys-subsystem-net-devices-enp5s6.device loaded active plugged RTL-8110SC/8169SC Gigabit Ethernet
sys-subsystem-net-devices-enp5s7.device loaded active plugged RTL-8110SC/8169SC Gigabit Ethernet
-.mount loaded active mounted /
boot-efi.mount loaded active mounted /boot/efi
boot.mount loaded active mounted /boot
dev-hugepages.mount loaded active mounted Huge Pages File System
dev-mqueue.mount loaded active mounted POSIX Message Queue File System
firewall\x2dscripts.mount loaded active mounted /firewall-scripts
gluster\x2dfw1.mount loaded active mounted /gluster-fw1
sys-fs-fuse-connections.mount loaded active mounted FUSE Control File System
sys-kernel-config.mount loaded active mounted Configuration File System
sys-kernel-debug.mount loaded active mounted Debug File System
tmp.mount loaded active mounted Temporary Directory
cups.path loaded active waiting CUPS Printer Service Spool
systemd-ask-password-plymouth.path loaded active waiting Forward Password Requests to Plymouth Directory Watc
systemd-ask-password-wall.path loaded active waiting Forward Password Requests to Wall Directory Watch
abrt-ccpp.service loaded active exited Install ABRT coredump hook
abrt-oops.service loaded active running ABRT kernel log watcher
abrt-xorg.service loaded active running ABRT Xorg log watcher
abrtd.service loaded active running ABRT Automated Bug Reporting Tool
alsa-state.service loaded active running Manage Sound Card State (restore and store)
atd.service loaded active running Job spooling tools
auditd.service loaded active running Security Auditing Service
avahi-daemon.service loaded active running Avahi mDNS/DNS-SD Stack
chronyd.service loaded active running NTP client/server
crond.service loaded active running Command Scheduler
dbus.service loaded active running D-Bus System Message Bus
fedora-loadmodules.service loaded active exited Load legacy module configuration
fedora-readonly.service loaded active exited Configure read-only root support
***@tty1.service loaded active running Getty on tty1
glusterd.service loaded active running GlusterFS an clustered file-system server
irqbalance.service loaded active running irqbalance daemon
lvm2-lvmetad.service loaded active running LVM2 metadata daemon
lvm2-monitor.service loaded active exited Monitoring of LVM2 mirrors, snapshots etc. using dme
mcelog.service loaded active running Machine Check Exception Logging Daemon
network.service loaded active exited LSB: Bring up/down networking
rc-local.service loaded active running /etc/rc.d/rc.local Compatibility
rngd.service loaded failed failed Hardware RNG Entropy Gatherer Daemon
rpcbind.service loaded active running RPC bind service
rsyslog.service loaded active running System Logging Service
smartd.service loaded active running Self Monitoring and Reporting Technology (SMART) Dae
sshd.service loaded active running OpenSSH server daemon
systemd-journald.service loaded active running Journal Service
systemd-logind.service loaded active running Login Service
systemd-readahead-collect.service loaded active exited Collect Read-Ahead Data
systemd-readahead-replay.service loaded active exited Replay Read-Ahead Data
systemd-remount-fs.service loaded active exited Remount Root and Kernel File Systems
systemd-sysctl.service loaded active exited Apply Kernel Variables
systemd-tmpfiles-setup.service loaded active exited Recreate Volatile Files and Directories
systemd-udev-trigger.service loaded active exited udev Coldplug all Devices
systemd-udevd.service loaded active running udev Kernel Device Manager
systemd-user-sessions.service loaded active exited Permit User Sessions
systemd-vconsole-setup.service loaded active exited Setup Virtual Console
vsftpd.service loaded active running Vsftpd ftp daemon
avahi-daemon.socket loaded active listening Avahi mDNS/DNS-SD Stack Activation Socket
cups.socket loaded active listening CUPS Printing Service Sockets
dbus.socket loaded active running D-Bus System Message Bus Socket
dm-event.socket loaded active listening Device-mapper event daemon FIFOs
iscsid.socket loaded active listening Open-iSCSI iscsid Socket
iscsiuio.socket loaded active listening Open-iSCSI iscsiuio Socket
lvm2-lvmetad.socket loaded active running LVM2 metadata daemon socket
pcscd.socket loaded active listening PC/SC Smart Card Daemon Activation Socket
rpcbind.socket loaded active listening RPCbind Server Activation Socket
syslog.socket loaded active running Syslog Socket
systemd-initctl.socket loaded active listening /dev/initctl Compatibility Named Pipe
systemd-journald.socket loaded active running Journal Socket
systemd-shutdownd.socket loaded active listening Delayed Shutdown Socket
systemd-udevd-control.socket loaded active listening udev Control Socket
systemd-udevd-kernel.socket loaded active running udev Kernel Socket
dev-dm\x2d1.swap loaded active active /dev/dm-1
basic.target loaded active active Basic System
cryptsetup.target loaded active active Encrypted Volumes
getty.target loaded active active Login Prompts
local-fs-pre.target loaded active active Local File Systems (Pre)
local-fs.target loaded active active Local File Systems
multi-user.target loaded active active Multi-User System
network-online.target loaded active active Network is Online
network.target loaded active active Network
paths.target loaded active active Paths
remote-fs.target loaded active active Remote File Systems
sockets.target loaded active active Sockets
sound.target loaded active active Sound Card
swap.target loaded active active Swap
sysinit.target loaded active active System Initialization
timers.target loaded active active Timers
systemd-readahead-done.timer loaded active elapsed Stop Read-Ahead Data Collection 10s After Completed
systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

110 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
lines 94-118/118 (END)

To save a headache, now just the extract from fw2

[***@chicago-fw2 ~]# systemctl
UNIT LOAD ACTIVE SUB DESCRIPTION
.
.
.
firewall\x2dscripts.mount loaded failed failed /firewall-scripts
.
.
.
[***@chicago-fw2 rc.d]# systemctl status firewall\x2dscripts.mount -n 50
firewallx2dscripts.mount
Loaded: error (Reason: No such file or directory)
Active: inactive (dead)

And here is what that mystery service file looks like. Is my dash, "-", some kind of illegal character in fstab??

::::::::::::::
firewall\x2dscripts.mount
::::::::::::::
# Automatically generated by systemd-fstab-generator

[Unit]
SourcePath=/etc/fstab
DefaultDependencies=no
After=remote-fs-pre.target
After=network.target
After=network-online.target
Wants=network-online.target
Conflicts=umount.target
Before=umount.target
Before=remote-fs.target

[Mount]
What=localhost:/firewall-scripts
Where=/firewall-scripts
Type=glusterfs
FsckPassNo=0
Options=defaults,_netdev
Greg Scott
2013-07-16 14:47:22 UTC
Permalink
Sorry everyone, that whole path of looking at services with systemctl is a dead-end. Doing mount -av by hand makes that firewall\x2dscripts.mount service loaded and active. I've been mounting and umounting on fw1 and that's why the service was active on fw1 and failed on fw2.

- Greg
raghav
2013-07-10 10:17:07 UTC
Permalink
I don't get this. I have a replicated volume and 2 nodes. My
challenge is, when I take one node offline, the other node can no
longer access the volume until both nodes are back online again.
I have 2 nodes, fw1 and fw2. Each node has an XFS file system,
/gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at
IP Address 192.168.253.1. Node fw2 is at 192.168.253.2.
I create a gluster volume named firewall-scripts which is a replica of
those two XFS file systems. The volume holds a bunch of config files
common to both fw1 and fw2. The application is an active/standby pair
of firewalls and the idea is to keep config files in a gluster volume.
When both nodes are online, everything works as expected. But when I
ls: cannot access /firewall-scripts: Transport endpoint is not connected
And when I bring the offline node back online, node fw2 eventually behaves normally again.
What's up with that? Gluster is supposed to be resilient and
self-healing and able to stand up to this sort of abuse. So I must be
doing something wrong.
Here is how I set up everything -- it doesn't get much simpler than
this and my setup is right out the Getting Started Guide but using my
own names.
gluster peer probe 192.168.253.2
gluster peer status
gluster volume create firewall-scripts replica 2 transport tcp
192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs
defaults,_netdev 0 0
mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs
defaults,_netdev 0 0
That's it. That's the whole setup. When both nodes are online,
everything replicates beautifully. But take one node offline and it
all falls apart.
Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see
[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init]
0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk]
0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not
connected)
[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig]
0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv]
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I
[client-handshake.c:1658:select_server_supported_programs]
0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num
(1298437), Version (330)
[2013-07-09 01:01:35.019273] I
[client-handshake.c:1456:client_setvolume_cbk]
0-firewall-scripts-client-0: Connected to 192.168.253.1:49152,
attached to remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I
[client-handshake.c:1468:client_setvolume_cbk]
0-firewall-scripts-client-0: Server and Client lk-version numbers are
not same, reopening the fds
[2013-07-09 01:01:35.019441] I
[client-handshake.c:1308:client_post_handshake]
0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they
are re-opened
[2013-07-09 01:01:35.020070] I
[client-handshake.c:930:client_child_up_reopen_done]
0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd -
notifying CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify]
0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0'
came back up; going online.
[2013-07-09 01:01:35.020616] I
[client-handshake.c:450:client_set_lk_version_cbk]
0-firewall-scripts-client-0: Server lk version = 1
So how do I make glusterfs survive a node failure, which is the whole point of all this?
It looks like the brick processes on fw2 machine are not running and
hence when fw1 is down, the entire replication process is stalled. can u
do a ps and get the status of all the gluster processes and ensure that
the brick process is up on fw2.

Regards
Raghav
Greg Scott
2013-07-10 21:57:16 UTC
Permalink
It looks like the brick processes on fw2 machine are not running and hence when fw1 is down, the
entire replication process is stalled. can u do a ps and get the status of all the gluster processes and
ensure that the brick process is up on fw2.
I was away from this most of the day. Here is a ps ax | grep gluster from both fw1 and fw2 while both nodes are online.
Greg Scott
2013-07-10 22:04:54 UTC
Permalink
And here is ps ax | grep gluster from both nodes when fw1 is offline. Note I have it mounted right now with the 'backupvolfile-server=<secondary server> mount option. The ps ax | grep gluster output looks the same now as it did when both nodes were online.
Loading...