Discussion:
Gluster failure due to "0-management: Lock not released for <volumename>"
(too old to reply)
Victor Nomura
2017-06-21 18:10:52 UTC
Permalink
Raw Message
Hi All,



I'm fairly new to Gluster (3.10.3) and got it going for a couple of months
now but suddenly after a power failure in our building it all came crashing
down. No client is able to connect after powering back the 3 nodes I have
setup.



Looking at the logs, it looks like there's some sort of "Lock" placed on the
volume which prevents all the clients from connecting to the Gluster
endpoint.



I can't even do a #gluster volume status all command IF more than 1 node is
powered up. I have to shutdown node2-3 and then I am able to issue the
command on node1 to see volume status. When all nodes are powered up and I
check the peer status, it says that all peers are connected. Trying to
connect to the Gluster volume from all clients says gluster endpoint is not
available and times out. There are no network issues and each node can ping
each other and there are no firewalls or any other device between the nodes
and clients.



Please help if you think you know how to fix this. I have a feeling it's
this "lock" that's not "released" due to the whole setup losing power all of
a sudden. I've tried restarting all the nodes, restarting glusterfs-server
etc. I'm out of ideas.



Thanks in advance!



Victor



Volume Name: teravolume

Type: Distributed-Replicate

Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5

Status: Started

Snapshot Count: 0

Number of Bricks: 3 x 2 = 6

Transport-type: tcp

Bricks:

Brick1: gfsnode1:/media/brick1

Brick2: gfsnode2:/media/brick1

Brick3: gfsnode3:/media/brick1

Brick4: gfsnode1:/media/brick2

Brick5: gfsnode2:/media/brick2

Brick6: gfsnode3:/media/brick2

Options Reconfigured:

nfs.disable: on





[2017-06-21 16:02:52.376709] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not
released for teravolume

[2017-06-21 16:03:03.429032] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management:
using the op-version 31000

[2017-06-21 16:13:13.326478] E [rpc-clnt.c:200:call_bail] 0-management:
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.202284. timeout = 600 for 192.168.150.52:$

[2017-06-21 16:13:13.326519] E [rpc-clnt.c:200:call_bail] 0-management:
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.204555. timeout = 600 for 192.168.150.53:$

[2017-06-21 16:18:34.456522] I [MSGID: 106004]
[glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer
<gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>), in state <Peer in
Cluste$

[2017-06-21 16:18:34.456619] W
[glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f
879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:34.456638] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not
released for teravolume

[2017-06-21 16:18:34.456661] I [MSGID: 106004]
[glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer
<gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>), in state <Peer in
Cluste$

[2017-06-21 16:18:34.456692] W
[glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f
879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:43.323944] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management:
using the op-version 31000

[2017-06-21 16:18:34.456699] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not
released for teravolume

[2017-06-21 16:18:45.628552] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management:
using the op-version 31000

[2017-06-21 16:23:40.607173] I [MSGID: 106499]
[glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management:
Received status volume req for volume teravolume
Atin Mukherjee
2017-06-22 16:00:20 UTC
Permalink
Raw Message
Could you attach glusterd.log and cmd_history.log files from all the nodes?
Post by Victor Nomura
Hi All,
I’m fairly new to Gluster (3.10.3) and got it going for a couple of months
now but suddenly after a power failure in our building it all came crashing
down. No client is able to connect after powering back the 3 nodes I have
setup.
Looking at the logs, it looks like there’s some sort of “Lock” placed on
the volume which prevents all the clients from connecting to the Gluster
endpoint.
I can’t even do a #gluster volume status all command IF more than 1 node
is powered up. I have to shutdown node2-3 and then I am able to issue the
command on node1 to see volume status. When all nodes are powered up and
I check the peer status, it says that all peers are connected. Trying to
connect to the Gluster volume from all clients says gluster endpoint is not
available and times out. There are no network issues and each node can
ping each other and there are no firewalls or any other device between the
nodes and clients.
Please help if you think you know how to fix this. I have a feeling it’s
this “lock” that’s not “released” due to the whole setup losing power all
of a sudden. I’ve tried restarting all the nodes, restarting
glusterfs-server etc. I’m out of ideas.
Thanks in advance!
Victor
Volume Name: teravolume
Type: Distributed-Replicate
Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Brick1: gfsnode1:/media/brick1
Brick2: gfsnode2:/media/brick1
Brick3: gfsnode3:/media/brick1
Brick4: gfsnode1:/media/brick2
Brick5: gfsnode2:/media/brick2
Brick6: gfsnode3:/media/brick2
nfs.disable: on
[2017-06-21 16:02:52.376709] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify]
0-management: Lock not released for teravolume
[2017-06-21 16:03:03.429032] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.202284. timeout = 600 for 192.168.150.52:$
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.204555. timeout = 600 for 192.168.150.53:$
[2017-06-21 16:18:34.456522] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify]
0-management: Peer <gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>),
in state <Peer in Cluste$
[2017-06-21 16:18:34.456619] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:34.456638] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify]
0-management: Lock not released for teravolume
[2017-06-21 16:18:34.456661] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify]
0-management: Peer <gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>),
in state <Peer in Cluste$
[2017-06-21 16:18:34.456692] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:43.323944] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:18:34.456699] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify]
0-management: Lock not released for teravolume
[2017-06-21 16:18:45.628552] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:23:40.607173] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume]
0-management: Received status volume req for volume teravolume
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Atin Mukherjee
2017-06-27 07:28:43 UTC
Permalink
Raw Message
I had looked at the logs shared by Victor privately and it seems to be
there is a N/W glitch in the cluster which is causing the glusterd to lose
its connection with other peers and as a side effect to this, lot of rpc
requests are getting bailed out resulting glusterd to end up into a stale
lock and hence you see that some of the commands failed with "another
transaction is in progress or locking failed."

Some examples of the symptom highlighted:

[2017-06-21 23:02:03.826858] E [rpc-clnt.c:200:call_bail] 0-management:
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21
22:52:02.719068. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:03.826888] E [rpc-clnt.c:200:call_bail] 0-management:
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21
22:52:02.716782. timeout = 600 for 192.168.150.52:24007
[2017-06-21 23:02:53.836936] E [rpc-clnt.c:200:call_bail] 0-management:
bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent =
2017-06-21 22:52:47.909169. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:53.836991] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Locking
failed on gfsnode3. Please check log file for details.
[2017-06-21 23:02:53.837016] E [rpc-clnt.c:200:call_bail] 0-management:
bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent =
2017-06-21 22:52:47.909175. timeout = 600 for 192.168.150.52:24007

I'd like you to request to first look at the N/W layer and rectify the
problems.
Post by Atin Mukherjee
Could you attach glusterd.log and cmd_history.log files from all the nodes?
Post by Victor Nomura
Hi All,
I’m fairly new to Gluster (3.10.3) and got it going for a couple of
months now but suddenly after a power failure in our building it all came
crashing down. No client is able to connect after powering back the 3
nodes I have setup.
Looking at the logs, it looks like there’s some sort of “Lock” placed on
the volume which prevents all the clients from connecting to the Gluster
endpoint.
I can’t even do a #gluster volume status all command IF more than 1 node
is powered up. I have to shutdown node2-3 and then I am able to issue the
command on node1 to see volume status. When all nodes are powered up
and I check the peer status, it says that all peers are connected. Trying
to connect to the Gluster volume from all clients says gluster endpoint is
not available and times out. There are no network issues and each node
can ping each other and there are no firewalls or any other device between
the nodes and clients.
Please help if you think you know how to fix this. I have a feeling it’s
this “lock” that’s not “released” due to the whole setup losing power all
of a sudden. I’ve tried restarting all the nodes, restarting
glusterfs-server etc. I’m out of ideas.
Thanks in advance!
Victor
Volume Name: teravolume
Type: Distributed-Replicate
Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Brick1: gfsnode1:/media/brick1
Brick2: gfsnode2:/media/brick1
Brick3: gfsnode3:/media/brick1
Brick4: gfsnode1:/media/brick2
Brick5: gfsnode2:/media/brick2
Brick6: gfsnode3:/media/brick2
nfs.disable: on
[2017-06-21 16:02:52.376709] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock
not released for teravolume
[2017-06-21 16:03:03.429032] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.202284. timeout = 600 for 192.168.150.52:$
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.204555. timeout = 600 for 192.168.150.53:$
[2017-06-21 16:18:34.456522] I [MSGID: 106004]
[glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer
<gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>), in state <Peer in
Cluste$
[2017-06-21 16:18:34.456619] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:34.456638] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock
not released for teravolume
[2017-06-21 16:18:34.456661] I [MSGID: 106004]
[glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer
<gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>), in state <Peer in
Cluste$
[2017-06-21 16:18:34.456692] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:43.323944] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:18:34.456699] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock
not released for teravolume
[2017-06-21 16:18:45.628552] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:23:40.607173] I [MSGID: 106499]
Received status volume req for volume teravolume
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Victor Nomura
2017-06-29 17:20:52 UTC
Permalink
Raw Message
Thanks for the reply. What would be the best course of action? The data on the volume isn’t important right now but I’m worried when our setup goes to production we don’t have the same situation and really need to recover our Gluster setup.



I’m assuming that to redo is to delete everything in the /var/lib/glusterd directory on each of the nodes and recreate the volume again. Essentially starting over. If I leave the mount points the same and keep the data&setup intact will the files still be there and accessible after? (I don’t delete the data on the bricks)



Regards,



Victor Nomura



From: Atin Mukherjee [mailto:***@redhat.com]
Sent: June-27-17 12:29 AM
To: Victor Nomura
Cc: gluster-users
Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"



I had looked at the logs shared by Victor privately and it seems to be there is a N/W glitch in the cluster which is causing the glusterd to lose its connection with other peers and as a side effect to this, lot of rpc requests are getting bailed out resulting glusterd to end up into a stale lock and hence you see that some of the commands failed with "another transaction is in progress or locking failed."

Some examples of the symptom highlighted:

[2017-06-21 23:02:03.826858] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.719068. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:03.826888] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.716782. timeout = 600 for 192.168.150.52:24007
[2017-06-21 23:02:53.836936] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909169. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:53.836991] E [MSGID: 106116] [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Locking failed on gfsnode3. Please check log file for details.
[2017-06-21 23:02:53.837016] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909175. timeout = 600 for 192.168.150.52:24007

I'd like you to request to first look at the N/W layer and rectify the problems.








On Thu, Jun 22, 2017 at 9:30 PM, Atin Mukherjee <***@redhat.com> wrote:

Could you attach glusterd.log and cmd_history.log files from all the nodes?



On Wed, Jun 21, 2017 at 11:40 PM, Victor Nomura <***@mezine.com> wrote:

Hi All,



I’m fairly new to Gluster (3.10.3) and got it going for a couple of months now but suddenly after a power failure in our building it all came crashing down. No client is able to connect after powering back the 3 nodes I have setup.



Looking at the logs, it looks like there’s some sort of “Lock” placed on the volume which prevents all the clients from connecting to the Gluster endpoint.



I can’t even do a #gluster volume status all command IF more than 1 node is powered up. I have to shutdown node2-3 and then I am able to issue the command on node1 to see volume status. When all nodes are powered up and I check the peer status, it says that all peers are connected. Trying to connect to the Gluster volume from all clients says gluster endpoint is not available and times out. There are no network issues and each node can ping each other and there are no firewalls or any other device between the nodes and clients.



Please help if you think you know how to fix this. I have a feeling it’s this “lock” that’s not “released” due to the whole setup losing power all of a sudden. I’ve tried restarting all the nodes, restarting glusterfs-server etc. I’m out of ideas.



Thanks in advance!



Victor



Volume Name: teravolume

Type: Distributed-Replicate

Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5

Status: Started

Snapshot Count: 0

Number of Bricks: 3 x 2 = 6

Transport-type: tcp

Bricks:

Brick1: gfsnode1:/media/brick1

Brick2: gfsnode2:/media/brick1

Brick3: gfsnode3:/media/brick1

Brick4: gfsnode1:/media/brick2

Brick5: gfsnode2:/media/brick2

Brick6: gfsnode3:/media/brick2

Options Reconfigured:

nfs.disable: on





[2017-06-21 16:02:52.376709] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:03:03.429032] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:13:13.326478] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.202284. timeout = 600 for 192.168.150.52:$

[2017-06-21 16:13:13.326519] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.204555. timeout = 600 for 192.168.150.53:$

[2017-06-21 16:18:34.456522] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456619] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:34.456638] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:34.456661] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456692] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:43.323944] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:18:34.456699] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:45.628552] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:23:40.607173] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume teravolume
Atin Mukherjee
2017-06-30 10:40:24 UTC
Permalink
Raw Message
Post by Victor Nomura
Thanks for the reply. What would be the best course of action? The data
on the volume isn’t important right now but I’m worried when our setup goes
to production we don’t have the same situation and really need to recover
our Gluster setup.
I’m assuming that to redo is to delete everything in the /var/lib/glusterd
directory on each of the nodes and recreate the volume again. Essentially
starting over. If I leave the mount points the same and keep the
data&setup intact will the files still be there and accessible after? (I
don’t delete the data on the bricks)
I dont think there is anything wrong at gluster stack. If you cross check
the n/w layer and make sure its up all the time then restarting glusterd on
all the nodes should resolve the stale locks.
Post by Victor Nomura
Regards,
Victor Nomura
*Sent:* June-27-17 12:29 AM
*To:* Victor Nomura
*Cc:* gluster-users
*Subject:* Re: [Gluster-users] Gluster failure due to "0-management: Lock
not released for <volumename>"
I had looked at the logs shared by Victor privately and it seems to be
there is a N/W glitch in the cluster which is causing the glusterd to lose
its connection with other peers and as a side effect to this, lot of rpc
requests are getting bailed out resulting glusterd to end up into a stale
lock and hence you see that some of the commands failed with "another
transaction is in progress or locking failed."
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21
22:52:02.719068. timeout = 600 for 192.168.150.53:24007
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21
22:52:02.716782. timeout = 600 for 192.168.150.52:24007
bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent =
2017-06-21 22:52:47.909169. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:53.836991] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Locking
failed on gfsnode3. Please check log file for details.
bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent =
2017-06-21 22:52:47.909175. timeout = 600 for 192.168.150.52:24007
I'd like you to request to first look at the N/W layer and rectify the problems.
Could you attach glusterd.log and cmd_history.log files from all the nodes?
Hi All,
I’m fairly new to Gluster (3.10.3) and got it going for a couple of months
now but suddenly after a power failure in our building it all came crashing
down. No client is able to connect after powering back the 3 nodes I have
setup.
Looking at the logs, it looks like there’s some sort of “Lock” placed on
the volume which prevents all the clients from connecting to the Gluster
endpoint.
I can’t even do a #gluster volume status all command IF more than 1 node
is powered up. I have to shutdown node2-3 and then I am able to issue the
command on node1 to see volume status. When all nodes are powered up and
I check the peer status, it says that all peers are connected. Trying to
connect to the Gluster volume from all clients says gluster endpoint is not
available and times out. There are no network issues and each node can
ping each other and there are no firewalls or any other device between the
nodes and clients.
Please help if you think you know how to fix this. I have a feeling it’s
this “lock” that’s not “released” due to the whole setup losing power all
of a sudden. I’ve tried restarting all the nodes, restarting
glusterfs-server etc. I’m out of ideas.
Thanks in advance!
Victor
Volume Name: teravolume
Type: Distributed-Replicate
Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Brick1: gfsnode1:/media/brick1
Brick2: gfsnode2:/media/brick1
Brick3: gfsnode3:/media/brick1
Brick4: gfsnode1:/media/brick2
Brick5: gfsnode2:/media/brick2
Brick6: gfsnode3:/media/brick2
nfs.disable: on
[2017-06-21 16:02:52.376709] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not
released for teravolume
[2017-06-21 16:03:03.429032] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.202284. timeout = 600 for 192.168.150.52:$
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.204555. timeout = 600 for 192.168.150.53:$
[2017-06-21 16:18:34.456522] I [MSGID: 106004]
[glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer
<gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>), in state <Peer in
Cluste$
[2017-06-21 16:18:34.456619] W
[glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:34.456638] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not
released for teravolume
[2017-06-21 16:18:34.456661] I [MSGID: 106004]
[glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer
<gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>), in state <Peer in
Cluste$
[2017-06-21 16:18:34.456692] W
[glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:43.323944] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:18:34.456699] W [MSGID: 106118]
[glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not
released for teravolume
[2017-06-21 16:18:45.628552] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:23:40.607173] I [MSGID: 106499]
Received status volume req for volume teravolume
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
Victor Nomura
2017-07-04 16:40:43 UTC
Permalink
Raw Message
The nodes have all been rebooted numerous times with no difference in outcome. The nodes are all connected to the same switch and I also replaced it to see if made any difference.



There is no issues with connectivity network wise and no firewall in place between the nodes.



I can’t do a gluster volume status without it timing out the moment the other 2 nodes are connected to the switch. Which is odd. With one node turned on and the others off, I can perform some volume commands but the moment any one of the others are connected, a lot of commands just timeout. There’s no IP address conflict or anything of that nature either.



Seems nothing can resolve the locks. Is there a manual way to resolve the locks?



Regards,



Victor









From: Atin Mukherjee [mailto:***@redhat.com]
Sent: June-30-17 3:40 AM
To: Victor Nomura
Cc: gluster-users
Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"





On Thu, 29 Jun 2017 at 22:51, Victor Nomura <***@mezine.com> wrote:

Thanks for the reply. What would be the best course of action? The data on the volume isn’t important right now but I’m worried when our setup goes to production we don’t have the same situation and really need to recover our Gluster setup.



I’m assuming that to redo is to delete everything in the /var/lib/glusterd directory on each of the nodes and recreate the volume again. Essentially starting over. If I leave the mount points the same and keep the data&setup intact will the files still be there and accessible after? (I don’t delete the data on the bricks)



I dont think there is anything wrong at gluster stack. If you cross check the n/w layer and make sure its up all the time then restarting glusterd on all the nodes should resolve the stale locks.





Regards,



Victor Nomura



From: Atin Mukherjee [mailto:***@redhat.com]
Sent: June-27-17 12:29 AM


To: Victor Nomura
Cc: gluster-users

Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"



I had looked at the logs shared by Victor privately and it seems to be there is a N/W glitch in the cluster which is causing the glusterd to lose its connection with other peers and as a side effect to this, lot of rpc requests are getting bailed out resulting glusterd to end up into a stale lock and hence you see that some of the commands failed with "another transaction is in progress or locking failed."

Some examples of the symptom highlighted:

[2017-06-21 23:02:03.826858] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.719068. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:03.826888] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.716782. timeout = 600 for 192.168.150.52:24007
[2017-06-21 23:02:53.836936] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909169. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:53.836991] E [MSGID: 106116] [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Locking failed on gfsnode3. Please check log file for details.
[2017-06-21 23:02:53.837016] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909175. timeout = 600 for 192.168.150.52:24007

I'd like you to request to first look at the N/W layer and rectify the problems.







On Thu, Jun 22, 2017 at 9:30 PM, Atin Mukherjee <***@redhat.com> wrote:

Could you attach glusterd.log and cmd_history.log files from all the nodes?



On Wed, Jun 21, 2017 at 11:40 PM, Victor Nomura <***@mezine.com> wrote:

Hi All,



I’m fairly new to Gluster (3.10.3) and got it going for a couple of months now but suddenly after a power failure in our building it all came crashing down. No client is able to connect after powering back the 3 nodes I have setup.



Looking at the logs, it looks like there’s some sort of “Lock” placed on the volume which prevents all the clients from connecting to the Gluster endpoint.



I can’t even do a #gluster volume status all command IF more than 1 node is powered up. I have to shutdown node2-3 and then I am able to issue the command on node1 to see volume status. When all nodes are powered up and I check the peer status, it says that all peers are connected. Trying to connect to the Gluster volume from all clients says gluster endpoint is not available and times out. There are no network issues and each node can ping each other and there are no firewalls or any other device between the nodes and clients.



Please help if you think you know how to fix this. I have a feeling it’s this “lock” that’s not “released” due to the whole setup losing power all of a sudden. I’ve tried restarting all the nodes, restarting glusterfs-server etc. I’m out of ideas.



Thanks in advance!



Victor



Volume Name: teravolume

Type: Distributed-Replicate

Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5

Status: Started

Snapshot Count: 0

Number of Bricks: 3 x 2 = 6

Transport-type: tcp

Bricks:

Brick1: gfsnode1:/media/brick1

Brick2: gfsnode2:/media/brick1

Brick3: gfsnode3:/media/brick1

Brick4: gfsnode1:/media/brick2

Brick5: gfsnode2:/media/brick2

Brick6: gfsnode3:/media/brick2

Options Reconfigured:

nfs.disable: on





[2017-06-21 16:02:52.376709] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:03:03.429032] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:13:13.326478] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.202284. timeout = 600 for 192.168.150.52:$

[2017-06-21 16:13:13.326519] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.204555. timeout = 600 for 192.168.150.53:$

[2017-06-21 16:18:34.456522] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456619] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:34.456638] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:34.456661] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456692] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:43.323944] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:18:34.456699] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:45.628552] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:23:40.607173] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume teravolume



_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
Victor Nomura
2017-07-04 18:25:52 UTC
Permalink
Raw Message
Specifically, I must stop glusterfs-server service on the other nodes in order to perform any gluster commands on any node.



From: Victor Nomura [mailto:***@mezine.com]
Sent: July-04-17 9:41 AM
To: 'Atin Mukherjee'
Cc: 'gluster-users'
Subject: RE: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"



The nodes have all been rebooted numerous times with no difference in outcome. The nodes are all connected to the same switch and I also replaced it to see if made any difference.



There is no issues with connectivity network wise and no firewall in place between the nodes.



I can’t do a gluster volume status without it timing out the moment the other 2 nodes are connected to the switch. Which is odd. With one node turned on and the others off, I can perform some volume commands but the moment any one of the others are connected, a lot of commands just timeout. There’s no IP address conflict or anything of that nature either.



Seems nothing can resolve the locks. Is there a manual way to resolve the locks?



Regards,



Victor









From: Atin Mukherjee [mailto:***@redhat.com]
Sent: June-30-17 3:40 AM
To: Victor Nomura
Cc: gluster-users
Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"





On Thu, 29 Jun 2017 at 22:51, Victor Nomura <***@mezine.com> wrote:

Thanks for the reply. What would be the best course of action? The data on the volume isn’t important right now but I’m worried when our setup goes to production we don’t have the same situation and really need to recover our Gluster setup.



I’m assuming that to redo is to delete everything in the /var/lib/glusterd directory on each of the nodes and recreate the volume again. Essentially starting over. If I leave the mount points the same and keep the data&setup intact will the files still be there and accessible after? (I don’t delete the data on the bricks)



I dont think there is anything wrong at gluster stack. If you cross check the n/w layer and make sure its up all the time then restarting glusterd on all the nodes should resolve the stale locks.





Regards,



Victor Nomura



From: Atin Mukherjee [mailto:***@redhat.com]
Sent: June-27-17 12:29 AM


To: Victor Nomura
Cc: gluster-users

Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"



I had looked at the logs shared by Victor privately and it seems to be there is a N/W glitch in the cluster which is causing the glusterd to lose its connection with other peers and as a side effect to this, lot of rpc requests are getting bailed out resulting glusterd to end up into a stale lock and hence you see that some of the commands failed with "another transaction is in progress or locking failed."

Some examples of the symptom highlighted:

[2017-06-21 23:02:03.826858] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.719068. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:03.826888] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.716782. timeout = 600 for 192.168.150.52:24007
[2017-06-21 23:02:53.836936] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909169. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:53.836991] E [MSGID: 106116] [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Locking failed on gfsnode3. Please check log file for details.
[2017-06-21 23:02:53.837016] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909175. timeout = 600 for 192.168.150.52:24007

I'd like you to request to first look at the N/W layer and rectify the problems.





On Thu, Jun 22, 2017 at 9:30 PM, Atin Mukherjee <***@redhat.com> wrote:

Could you attach glusterd.log and cmd_history.log files from all the nodes?



On Wed, Jun 21, 2017 at 11:40 PM, Victor Nomura <***@mezine.com> wrote:

Hi All,



I’m fairly new to Gluster (3.10.3) and got it going for a couple of months now but suddenly after a power failure in our building it all came crashing down. No client is able to connect after powering back the 3 nodes I have setup.



Looking at the logs, it looks like there’s some sort of “Lock” placed on the volume which prevents all the clients from connecting to the Gluster endpoint.



I can’t even do a #gluster volume status all command IF more than 1 node is powered up. I have to shutdown node2-3 and then I am able to issue the command on node1 to see volume status. When all nodes are powered up and I check the peer status, it says that all peers are connected. Trying to connect to the Gluster volume from all clients says gluster endpoint is not available and times out. There are no network issues and each node can ping each other and there are no firewalls or any other device between the nodes and clients.



Please help if you think you know how to fix this. I have a feeling it’s this “lock” that’s not “released” due to the whole setup losing power all of a sudden. I’ve tried restarting all the nodes, restarting glusterfs-server etc. I’m out of ideas.



Thanks in advance!



Victor



Volume Name: teravolume

Type: Distributed-Replicate

Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5

Status: Started

Snapshot Count: 0

Number of Bricks: 3 x 2 = 6

Transport-type: tcp

Bricks:

Brick1: gfsnode1:/media/brick1

Brick2: gfsnode2:/media/brick1

Brick3: gfsnode3:/media/brick1

Brick4: gfsnode1:/media/brick2

Brick5: gfsnode2:/media/brick2

Brick6: gfsnode3:/media/brick2

Options Reconfigured:

nfs.disable: on





[2017-06-21 16:02:52.376709] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:03:03.429032] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:13:13.326478] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.202284. timeout = 600 for 192.168.150.52:$

[2017-06-21 16:13:13.326519] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.204555. timeout = 600 for 192.168.150.53:$

[2017-06-21 16:18:34.456522] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456619] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:34.456638] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:34.456661] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456692] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:43.323944] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:18:34.456699] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:45.628552] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:23:40.607173] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume teravolume



_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
Atin Mukherjee
2017-07-05 05:06:48 UTC
Permalink
Raw Message
By any chance are you having any redundant peer entries in
/var/lib/glusterd/peers directory? Can you please share the content of this
folder from all the nodes?
Post by Victor Nomura
Specifically, I must stop glusterfs-server service on the other nodes in
order to perform any gluster commands on any node.
*Sent:* July-04-17 9:41 AM
*To:* 'Atin Mukherjee'
*Cc:* 'gluster-users'
*Subject:* RE: [Gluster-users] Gluster failure due to "0-management: Lock
not released for <volumename>"
The nodes have all been rebooted numerous times with no difference in
outcome. The nodes are all connected to the same switch and I also
replaced it to see if made any difference.
There is no issues with connectivity network wise and no firewall in place
between the nodes.
I can’t do a gluster volume status without it timing out the moment the
other 2 nodes are connected to the switch. Which is odd. With one node
turned on and the others off, I can perform some volume commands but the
moment any one of the others are connected, a lot of commands just
timeout. There’s no IP address conflict or anything of that nature either.
Seems nothing can resolve the locks. Is there a manual way to resolve the locks?
Regards,
Victor
*Sent:* June-30-17 3:40 AM
*To:* Victor Nomura
*Cc:* gluster-users
*Subject:* Re: [Gluster-users] Gluster failure due to "0-management: Lock
not released for <volumename>"
Thanks for the reply. What would be the best course of action? The data
on the volume isn’t important right now but I’m worried when our setup goes
to production we don’t have the same situation and really need to recover
our Gluster setup.
I’m assuming that to redo is to delete everything in the /var/lib/glusterd
directory on each of the nodes and recreate the volume again. Essentially
starting over. If I leave the mount points the same and keep the
data&setup intact will the files still be there and accessible after? (I
don’t delete the data on the bricks)
I dont think there is anything wrong at gluster stack. If you cross check
the n/w layer and make sure its up all the time then restarting glusterd on
all the nodes should resolve the stale locks.
Regards,
Victor Nomura
*Sent:* June-27-17 12:29 AM
*To:* Victor Nomura
*Cc:* gluster-users
*Subject:* Re: [Gluster-users] Gluster failure due to "0-management: Lock
not released for <volumename>"
I had looked at the logs shared by Victor privately and it seems to be
there is a N/W glitch in the cluster which is causing the glusterd to lose
its connection with other peers and as a side effect to this, lot of rpc
requests are getting bailed out resulting glusterd to end up into a stale
lock and hence you see that some of the commands failed with "another
transaction is in progress or locking failed."
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21
22:52:02.719068. timeout = 600 for 192.168.150.53:24007
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21
22:52:02.716782. timeout = 600 for 192.168.150.52:24007
bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent =
2017-06-21 22:52:47.909169. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:53.836991] E [MSGID: 106116]
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Locking
failed on gfsnode3. Please check log file for details.
bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent =
2017-06-21 22:52:47.909175. timeout = 600 for 192.168.150.52:24007
I'd like you to request to first look at the N/W layer and rectify the problems.
Could you attach glusterd.log and cmd_history.log files from all the nodes?
Hi All,
I’m fairly new to Gluster (3.10.3) and got it going for a couple of months
now but suddenly after a power failure in our building it all came crashing
down. No client is able to connect after powering back the 3 nodes I have
setup.
Looking at the logs, it looks like there’s some sort of “Lock” placed on
the volume which prevents all the clients from connecting to the Gluster
endpoint.
I can’t even do a #gluster volume status all command IF more than 1 node
is powered up. I have to shutdown node2-3 and then I am able to issue the
command on node1 to see volume status. When all nodes are powered up and
I check the peer status, it says that all peers are connected. Trying to
connect to the Gluster volume from all clients says gluster endpoint is not
available and times out. There are no network issues and each node can
ping each other and there are no firewalls or any other device between the
nodes and clients.
Please help if you think you know how to fix this. I have a feeling it’s
this “lock” that’s not “released” due to the whole setup losing power all
of a sudden. I’ve tried restarting all the nodes, restarting
glusterfs-server etc. I’m out of ideas.
Thanks in advance!
Victor
Volume Name: teravolume
Type: Distributed-Replicate
Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Brick1: gfsnode1:/media/brick1
Brick2: gfsnode2:/media/brick1
Brick3: gfsnode3:/media/brick1
Brick4: gfsnode1:/media/brick2
Brick5: gfsnode2:/media/brick2
Brick6: gfsnode3:/media/brick2
nfs.disable: on
[2017-06-21 16:02:52.376709] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify]
0-management: Lock not released for teravolume
[2017-06-21 16:03:03.429032] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.202284. timeout = 600 for 192.168.150.52:$
bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21
16:03:03.204555. timeout = 600 for 192.168.150.53:$
[2017-06-21 16:18:34.456522] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify]
0-management: Peer <gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>),
in state <Peer in Cluste$
[2017-06-21 16:18:34.456619] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:34.456638] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify]
0-management: Lock not released for teravolume
[2017-06-21 16:18:34.456661] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify]
0-management: Peer <gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>),
in state <Peer in Cluste$
[2017-06-21 16:18:34.456692] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879)
[0x7fee6bc22879] -->/usr/lib/x86_64-l$
[2017-06-21 16:18:43.323944] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:18:34.456699] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify]
0-management: Lock not released for teravolume
[2017-06-21 16:18:45.628552] I [MSGID: 106163]
[glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 31000
[2017-06-21 16:23:40.607173] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume]
0-management: Received status volume req for volume teravolume
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
Victor Nomura
2017-07-07 21:37:07 UTC
Permalink
Raw Message
It’s working again! After countless hours trying to get it fixed, I just redid everything and tested to see what caused Gluster to fail.



The problem went away and there no more locks after I disabled jumbo frames and changed the MTU back to 1500. If the MTU is set to 9000, Gluster was dead. Why? All other networking functions were fine.



Regards,



Victor Nomura





From: Atin Mukherjee [mailto:***@redhat.com]
Sent: July-04-17 10:07 PM
To: Victor Nomura
Cc: gluster-users
Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"



By any chance are you having any redundant peer entries in /var/lib/glusterd/peers directory? Can you please share the content of this folder from all the nodes?



On Tue, Jul 4, 2017 at 11:55 PM, Victor Nomura <***@mezine.com> wrote:

Specifically, I must stop glusterfs-server service on the other nodes in order to perform any gluster commands on any node.



From: Victor Nomura [mailto:***@mezine.com]
Sent: July-04-17 9:41 AM
To: 'Atin Mukherjee'
Cc: 'gluster-users'
Subject: RE: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"



The nodes have all been rebooted numerous times with no difference in outcome. The nodes are all connected to the same switch and I also replaced it to see if made any difference.



There is no issues with connectivity network wise and no firewall in place between the nodes.



I can’t do a gluster volume status without it timing out the moment the other 2 nodes are connected to the switch. Which is odd. With one node turned on and the others off, I can perform some volume commands but the moment any one of the others are connected, a lot of commands just timeout. There’s no IP address conflict or anything of that nature either.



Seems nothing can resolve the locks. Is there a manual way to resolve the locks?



Regards,



Victor









From: Atin Mukherjee [mailto:***@redhat.com]
Sent: June-30-17 3:40 AM


To: Victor Nomura
Cc: gluster-users
Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"





On Thu, 29 Jun 2017 at 22:51, Victor Nomura <***@mezine.com> wrote:

Thanks for the reply. What would be the best course of action? The data on the volume isn’t important right now but I’m worried when our setup goes to production we don’t have the same situation and really need to recover our Gluster setup.



I’m assuming that to redo is to delete everything in the /var/lib/glusterd directory on each of the nodes and recreate the volume again. Essentially starting over. If I leave the mount points the same and keep the data&setup intact will the files still be there and accessible after? (I don’t delete the data on the bricks)



I dont think there is anything wrong at gluster stack. If you cross check the n/w layer and make sure its up all the time then restarting glusterd on all the nodes should resolve the stale locks.





Regards,



Victor Nomura



From: Atin Mukherjee [mailto:***@redhat.com]
Sent: June-27-17 12:29 AM


To: Victor Nomura
Cc: gluster-users

Subject: Re: [Gluster-users] Gluster failure due to "0-management: Lock not released for <volumename>"



I had looked at the logs shared by Victor privately and it seems to be there is a N/W glitch in the cluster which is causing the glusterd to lose its connection with other peers and as a side effect to this, lot of rpc requests are getting bailed out resulting glusterd to end up into a stale lock and hence you see that some of the commands failed with "another transaction is in progress or locking failed."

Some examples of the symptom highlighted:

[2017-06-21 23:02:03.826858] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.719068. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:03.826888] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x4 sent = 2017-06-21 22:52:02.716782. timeout = 600 for 192.168.150.52:24007
[2017-06-21 23:02:53.836936] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909169. timeout = 600 for 192.168.150.53:24007
[2017-06-21 23:02:53.836991] E [MSGID: 106116] [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Locking failed on gfsnode3. Please check log file for details.
[2017-06-21 23:02:53.837016] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(glusterd mgmt v3) op(--(1)) xid = 0x5 sent = 2017-06-21 22:52:47.909175. timeout = 600 for 192.168.150.52:24007

I'd like you to request to first look at the N/W layer and rectify the problems.





On Thu, Jun 22, 2017 at 9:30 PM, Atin Mukherjee <***@redhat.com> wrote:

Could you attach glusterd.log and cmd_history.log files from all the nodes?



On Wed, Jun 21, 2017 at 11:40 PM, Victor Nomura <***@mezine.com> wrote:

Hi All,



I’m fairly new to Gluster (3.10.3) and got it going for a couple of months now but suddenly after a power failure in our building it all came crashing down. No client is able to connect after powering back the 3 nodes I have setup.



Looking at the logs, it looks like there’s some sort of “Lock” placed on the volume which prevents all the clients from connecting to the Gluster endpoint.



I can’t even do a #gluster volume status all command IF more than 1 node is powered up. I have to shutdown node2-3 and then I am able to issue the command on node1 to see volume status. When all nodes are powered up and I check the peer status, it says that all peers are connected. Trying to connect to the Gluster volume from all clients says gluster endpoint is not available and times out. There are no network issues and each node can ping each other and there are no firewalls or any other device between the nodes and clients.



Please help if you think you know how to fix this. I have a feeling it’s this “lock” that’s not “released” due to the whole setup losing power all of a sudden. I’ve tried restarting all the nodes, restarting glusterfs-server etc. I’m out of ideas.



Thanks in advance!



Victor



Volume Name: teravolume

Type: Distributed-Replicate

Volume ID: 85af74d0-f1bc-4b0d-8901-4dea6e4efae5

Status: Started

Snapshot Count: 0

Number of Bricks: 3 x 2 = 6

Transport-type: tcp

Bricks:

Brick1: gfsnode1:/media/brick1

Brick2: gfsnode2:/media/brick1

Brick3: gfsnode3:/media/brick1

Brick4: gfsnode1:/media/brick2

Brick5: gfsnode2:/media/brick2

Brick6: gfsnode3:/media/brick2

Options Reconfigured:

nfs.disable: on





[2017-06-21 16:02:52.376709] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:03:03.429032] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:13:13.326478] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.202284. timeout = 600 for 192.168.150.52:$

[2017-06-21 16:13:13.326519] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(2)) xid = 0x105 sent = 2017-06-21 16:03:03.204555. timeout = 600 for 192.168.150.53:$

[2017-06-21 16:18:34.456522] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode2> (<e1e1caa5-9842-40d8-8492-a82b079879a3>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456619] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:34.456638] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:34.456661] I [MSGID: 106004] [glusterd-handler.c:5888:__glusterd_peer_rpc_notify] 0-management: Peer <gfsnode3> (<59b9effa-2b88-4764-9130-4f31c14c362e>), in state <Peer in Cluste$

[2017-06-21 16:18:34.456692] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.10.3/xlator/mgmt/glusterd.so(+0x1f879) [0x7fee6bc22879] -->/usr/lib/x86_64-l$

[2017-06-21 16:18:43.323944] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:18:34.456699] W [MSGID: 106118] [glusterd-handler.c:5913:__glusterd_peer_rpc_notify] 0-management: Lock not released for teravolume

[2017-06-21 16:18:45.628552] I [MSGID: 106163] [glusterd-handshake.c:1309:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31000

[2017-06-21 16:23:40.607173] I [MSGID: 106499] [glusterd-handler.c:4363:__glusterd_handle_status_volume] 0-management: Received status volume req for volume teravolume



_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
Loading...