[Gluster-users] Fixing a rejected peer

Discussion:

Jamie Lawrence

2018-03-06 00:30:10 UTC

Hello,

So I'm seeing a rejected peer with 3.12.6. This is with a replica 3 volume.

It actually began as the same problem with a different peer. I noticed with (call it) gluster-2, when I couldn't make a new volume. I compared /var/lib/glusterd between them, and found that somehow the options in one of the vols differed. (I suspect this was due to attempting to create the volume via the Ovirt GUI; suffice to say I'm not using it for things like this in the future.) So I stopped the daemons and corrected that (gluster-2 had a tiering entry the others didn't).

Started things back up and now gluster-3 is being rejected by the other two. The error is below.

I'm tempted to repeat - down things, copy the checksum the "good" ones agree on, start things; but given that this has turned into a balloon-squeezing exercise, I want to make sure I'm not doing this the wrong way.

What is the currently accepted best method for fixing this?

And given that this happened on a nearly brand-new deployment, it worries me a bit that this happened while nothing hinky was going on - I installed Gluster manually, but the rest of the systems management has been via Ovirt. Has anyone else seen issues with this?

Thanks,

-j

- - snip - -

[2018-03-06 00:14:06.141281] I [MSGID: 106490] [glusterd-handler.c:2891:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
[2018-03-06 00:14:06.145540] I [MSGID: 106493] [glusterd-handler.c:2954:__glusterd_handle_probe_query] 0-glusterd: Responded to sc5-gluster-1, op_ret: 0, op_errno: 0, ret: 0
[2018-03-06 00:14:06.145697] I [MSGID: 106490] [glusterd-handler.c:2891:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
[2018-03-06 00:14:06.145831] I [MSGID: 106493] [glusterd-handler.c:2954:__glusterd_handle_probe_query] 0-glusterd: Responded to sc5-gluster-10g-2, op_ret: 0, op_errno: 0, ret: 0
[2018-03-06 00:14:06.149357] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
[2018-03-06 00:14:06.149631] E [MSGID: 106010] [glusterd-utils.c:3374:glusterd_compare_friend_volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 53769889, remote cksum = 2068896937 on peer sc5-gluster-2
[2018-03-06 00:14:06.149774] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sc5-gluster-2 (0), ret: 0, op_ret: -1
[2018-03-06 00:14:06.151393] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
[2018-03-06 00:14:06.152127] E [MSGID: 106010] [glusterd-utils.c:3374:glusterd_compare_friend_volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 53769889, remote cksum = 2068896937 on peer sc5-gluster-10g-1
[2018-03-06 00:14:06.152314] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sc5-gluster-10g-1 (0), ret: 0, op_ret: -1
[2018-03-06 00:14:06.164819] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /gluster-bricks/sc5_ovirt_engine/sc5_ovirt_engine on port 49152
[2018-03-06 00:14:06.443882] I [MSGID: 106487] [glusterd-handler.c:1485:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req

Atin Mukherjee

2018-03-06 02:41:48 UTC

Permalink

When you say the others didn't how many peers are you talking about? Are
they all running 3.12.6? We had a bug
https://bugzilla.redhat.com/show_bug.cgi?id=1544637 which could lead you to
such situations but that has been fixed in 3.12.6. So if all of the nodes
are running with the same version i.e. 3.12.6 and the cluster.op-version is
set to latest, then ideally you shouldn't see this problem. Could you
clarify?

Yes, that's the way. Copy /var/lib/glusterd/vols/<volname>/ from the good
node to the rejected one and restart glusterd service on the rejected peer.

Post by Jamie Lawrence
What is the currently accepted best method for fixing this?
And given that this happened on a nearly brand-new deployment, it worries
me a bit that this happened while nothing hinky was going on - I installed
Gluster manually, but the rest of the systems management has been via
Ovirt. Has anyone else seen issues with this?
Thanks,
-j
[2018-03-06 00:14:06.141281] I [MSGID: 106490] [glusterd-handler.c:2891:__glusterd_handle_probe_query]
0-glusterd: Received probe from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
[2018-03-06 00:14:06.145540] I [MSGID: 106493] [glusterd-handler.c:2954:__glusterd_handle_probe_query]
0-glusterd: Responded to sc5-gluster-1, op_ret: 0, op_errno: 0, ret: 0
[2018-03-06 00:14:06.145697] I [MSGID: 106490] [glusterd-handler.c:2891:__glusterd_handle_probe_query]
0-glusterd: Received probe from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
[2018-03-06 00:14:06.145831] I [MSGID: 106493] [glusterd-handler.c:2954:__glusterd_handle_probe_query]
0-glusterd: Responded to sc5-gluster-10g-2, op_ret: 0, op_errno: 0, ret: 0
[2018-03-06 00:14:06.149357] I [MSGID: 106490] [glusterd-handler.c:2540:__
glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from
uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
glusterd_compare_friend_volume] 0-management: Version of Cksums
sc5-ovirt_engine differ. local cksum = 53769889, remote cksum = 2068896937
on peer sc5-gluster-2
[2018-03-06 00:14:06.149774] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp]
0-glusterd: Responded to sc5-gluster-2 (0), ret: 0, op_ret: -1
[2018-03-06 00:14:06.151393] I [MSGID: 106490] [glusterd-handler.c:2540:__
glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from
uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
glusterd_compare_friend_volume] 0-management: Version of Cksums
sc5-ovirt_engine differ. local cksum = 53769889, remote cksum = 2068896937
on peer sc5-gluster-10g-1
[2018-03-06 00:14:06.152314] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp]
0-glusterd: Responded to sc5-gluster-10g-1 (0), ret: 0, op_ret: -1
[2018-03-06 00:14:06.164819] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind]
0-pmap: adding brick /gluster-bricks/sc5_ovirt_engine/sc5_ovirt_engine on
port 49152
[2018-03-06 00:14:06.443882] I [MSGID: 106487] [glusterd-handler.c:1485:__
glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Jamie Lawrence

2018-03-06 18:50:13 UTC

Permalink

Post by Jamie Lawrence
Hello,
So I'm seeing a rejected peer with 3.12.6. This is with a replica 3 volume.
It actually began as the same problem with a different peer. I noticed with (call it) gluster-2, when I couldn't make a new volume. I compared /var/lib/glusterd between them, and found that somehow the options in one of the vols differed. (I suspect this was due to attempting to create the volume via the Ovirt GUI; suffice to say I'm not using it for things like this in the future.) So I stopped the daemons and corrected that (gluster-2 had a tiering entry the others didn't).
When you say the others didn't how many peers are you talking about? Are they all running 3.12.6? We had a bug https://bugzilla.redhat.com/show_bug.cgi?id=1544637 which could lead you to such situations but that has been fixed in 3.12.6. So if all of the nodes are running with the same version i.e. 3.12.6 and the cluster.op-version is set to latest, then ideally you shouldn't see this problem. Could you clarify?

They all run 3.12.6, there are currently 3 peers total.

So, cluster.op-version is: 30800. I was previously unaware of the distinction, but in looking at the `info` file, the client op-version for volumes is 30712. Does that matter?

That bug does look like what happened, though.

Post by Jamie Lawrence
Started things back up and now gluster-3 is being rejected by the other two. The error is below.
I'm tempted to repeat - down things, copy the checksum the "good" ones agree on, start things; but given that this has turned into a balloon-squeezing exercise, I want to make sure I'm not doing this the wrong way.
Yes, that's the way. Copy /var/lib/glusterd/vols/<volname>/ from the good node to the rejected one and restart glusterd service on the rejected peer.

So I did this, and it immediately went back to rejected state. The `cksum` file immediately diverged.

-j

Jamie Lawrence

2018-03-06 19:48:10 UTC

Permalink

Just following up on the below after having some time to track down the differences.

On the bad peer, the `tier-enabled=0` line in .../vols/<volname>/info was removed after I copied it over and as mentioned, the cksum file changed to a value that doesn't match the others. The logs only complain about the cksum (appended below).

I haven't done anything with tiering; I suppose it is possible that Ovirt did something goofy when I tried using it, but I am very confused by this whack-a-mole game, and don't know how to resolve it.

-j

- - - -
[...]

So I did this, and it immediately went back to rejected state. The `cksum` file immediately diverged.

- - - - - -
[2018-03-06 18:31:25.380546] I [MSGID: 106005] [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: Brick 172.16.0.153:/gluster-bricks/sc5_ovirt_engine/sc5_ovirt_engine has disconnected from glusterd.
[2018-03-06 18:31:25.380913] I [MSGID: 106490] [glusterd-handler.c:2891:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
[2018-03-06 18:31:25.384259] I [MSGID: 106493] [glusterd-handler.c:2954:__glusterd_handle_probe_query] 0-glusterd: Responded to sc5-gluster-1.squaretrade.com, op_ret: 0, op_errno: 0, ret: 0
[2018-03-06 18:31:25.384411] I [MSGID: 106490] [glusterd-handler.c:2891:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
[2018-03-06 18:31:25.384541] I [MSGID: 106493] [glusterd-handler.c:2954:__glusterd_handle_probe_query] 0-glusterd: Responded to sc5-gluster-10g-2, op_ret: 0, op_errno: 0, ret: 0
[2018-03-06 18:31:25.388144] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
[2018-03-06 18:31:25.388795] E [MSGID: 106010] [glusterd-utils.c:3374:glusterd_compare_friend_volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 53769889, remote cksum = 2068896937 on peer sc5-gluster-10g-1.squaretrade.com
[2018-03-06 18:31:25.388978] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sc5-gluster-10g-1.squaretrade.com (0), ret: 0, op_ret: -1
[2018-03-06 18:31:25.390976] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
[2018-03-06 18:31:25.391241] E [MSGID: 106010] [glusterd-utils.c:3374:glusterd_compare_friend_volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 53769889, remote cksum = 2068896937 on peer sc5-gluster-2.squaretrade.com
[2018-03-06 18:31:25.391390] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sc5-gluster-2.squaretrade.com (0), ret: 0, op_ret: -1
[2018-03-06 18:31:25.402669] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /gluster-bricks/sc5_ovirt_engine/sc5_ovirt_engine on port 49152
[2018-03-06 18:31:37.422140] I [MSGID: 106487] [glusterd-handler.c:1485:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-03-06 18:32:06.551544] I [MSGID: 106487] [glusterd-handler.c:1485:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-03-06 18:32:32.414663] I [MSGID: 106487] [glusterd-handler.c:1243:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req sc5-gluster-10g-2 24007
[2018-03-06 18:32:36.957054] I [MSGID: 106487] [glusterd-handler.c:1485:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-03-06 18:32:43.067011] I [MSGID: 106499] [glusterd-handler.c:4303:__glusterd_handle_status_volume] 0-management: Received status volume req for volume sc5-ovirt_engine
[2018-03-06 18:33:07.020062] I [MSGID: 106487] [glusterd-handler.c:1485:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-03-06 18:33:36.435916] I [MSGID: 106487] [glusterd-handler.c:1485:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2018-03-06 18:34:26.754494] I [MSGID: 106499] [glusterd-handler.c:4303:__glusterd_handle_status_volume] 0-management: Received status volume req for volume sc5-ovirt_engine
[2018-03-06 18:35:05.206520] I [MSGID: 106488] [glusterd-handler.c:1548:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-03-06 18:35:05.524085] I [MSGID: 106499] [glusterd-handler.c:4303:__glusterd_handle_status_volume] 0-management: Received status volume req for volume sc5-ovirt_engine
The message "I [MSGID: 106487] [glusterd-handler.c:1485:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req" repeated 4 times between [2018-03-06 18:33:36.435916] and [2018-03-06 18:35:06.421623]

Jamie Lawrence

2018-03-06 22:43:56 UTC

Permalink

Post by Jamie Lawrence
I'm tempted to repeat - down things, copy the checksum the "good" ones agree on, start things; but given that this has turned into a balloon-squeezing exercise, I want to make sure I'm not doing this the wrong way.
Yes, that's the way. Copy /var/lib/glusterd/vols/<volname>/ from the good node to the rejected one and restart glusterd service on the rejected peer.

My apologies for the multiple messages - I'm having to work on this episodically.

I've tried again to reset state on the bad peer, to no avail. This time I downed all of the peers, copied things over, ensuring that the tier-enabled line was absent and started back up; the cksum immediately changed to some a bad value, the two good nodes added that line in, and the bad node didn't have it.

Just to have a clear view of this, I did it yet again, this time ensuring the tier-enbled line was present everywhere. Same result, except that it didn't add the tier-enabled line, which I suppose makes some sense.

One oddity - I see:

# gluster v get all cluster.op-version
Option Value
------ -----
cluster.op-version 30800

but from one of the `info` files:

op-version=30712
client-op-version=30712

I don't know what it means that the cluster is at one version but apparently the volume is set for another - I thought that was a cluster-level setting. (Client.op-version theoretically makes more sense - I can see Ovirt wanting an older version.)

I'm at a loss to fix this - copying /var/lib/glusterd/vol/<vol> over doesn't fix the problem. I'd be somewhat OK with trashing the volume and starting over, if it weren't for two things: (1) Ovirt was also a massive pain to set up, and it configured on this volume. But perhaps more importantly, I'm concerned with this happening again once this is in production, which would be Bad, especially if I don't have a fix.

So at this point, I'm unclear on how to move forward or even where more to look for potential problems.

-j

- - - -

[2018-03-06 22:30:32.421530] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
[2018-03-06 22:30:32.422582] E [MSGID: 106010] [glusterd-utils.c:3374:glusterd_compare_friend_volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 3949237931, remote cksum = 2068896937 on peer sc5-gluster-10g-1.squaretrade.com
[2018-03-06 22:30:32.422774] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sc5-gluster-10g-1.squaretrade.com (0), ret: 0, op_ret: -1
[2018-03-06 22:30:32.424621] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c, host: sc5-gluster-10g-1.squaretrade.com, port: 0
[2018-03-06 22:30:32.425563] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a, host: sc5-gluster-2.squaretrade.com, port: 0
[2018-03-06 22:30:32.426706] I [MSGID: 106163] [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30800
[2018-03-06 22:30:32.428075] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
[2018-03-06 22:30:32.428325] E [MSGID: 106010] [glusterd-utils.c:3374:glusterd_compare_friend_volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 3949237931, remote cksum = 2068896937 on peer sc5-gluster-2.squaretrade.com
[2018-03-06 22:30:32.428468] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sc5-gluster-2.squaretrade.com (0), ret: 0, op_ret: -1

Atin Mukherjee

2018-03-07 12:39:49 UTC

Permalink

Please run 'gluster v get all cluster.max-op-version' and what ever value
it throws up should be used to bump up the cluster.op-version (gluster v
set all cluster.op-version <value>) . With that if you restart the rejected
peer I believe the problem should go away, if it doesn't I'd need to
investigate further once you can pass down the glusterd and cmd_history log
files and the content of /var/lib/glusterd from all the nodes.

Post by Jamie Lawrence

Post by Jamie Lawrence
I'm tempted to repeat - down things, copy the checksum the "good" ones

agree on, start things; but given that this has turned into a
balloon-squeezing exercise, I want to make sure I'm not doing this the
wrong way.

Post by Jamie Lawrence
Yes, that's the way. Copy /var/lib/glusterd/vols/<volname>/ from the

good node to the rejected one and restart glusterd service on the rejected
peer.
My apologies for the multiple messages - I'm having to work on this episodically.
I've tried again to reset state on the bad peer, to no avail. This time I
downed all of the peers, copied things over, ensuring that the tier-enabled
line was absent and started back up; the cksum immediately changed to some
a bad value, the two good nodes added that line in, and the bad node didn't
have it.
Just to have a clear view of this, I did it yet again, this time ensuring
the tier-enbled line was present everywhere. Same result, except that it
didn't add the tier-enabled line, which I suppose makes some sense.
# gluster v get all cluster.op-version
Option Value
------ -----
cluster.op-version 30800
op-version=30712
client-op-version=30712
I don't know what it means that the cluster is at one version but
apparently the volume is set for another - I thought that was a
cluster-level setting. (Client.op-version theoretically makes more sense -
I can see Ovirt wanting an older version.)
I'm at a loss to fix this - copying /var/lib/glusterd/vol/<vol> over
doesn't fix the problem. I'd be somewhat OK with trashing the volume and
starting over, if it weren't for two things: (1) Ovirt was also a massive
pain to set up, and it configured on this volume. But perhaps more
importantly, I'm concerned with this happening again once this is in
production, which would be Bad, especially if I don't have a fix.
So at this point, I'm unclear on how to move forward or even where more to
look for potential problems.
-j
- - - -
[2018-03-06 22:30:32.421530] I [MSGID: 106490] [glusterd-handler.c:2540:__
glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from
uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c
glusterd_compare_friend_volume] 0-management: Version of Cksums
sc5-ovirt_engine differ. local cksum = 3949237931, remote cksum =
2068896937 on peer sc5-gluster-10g-1.squaretrade.com
[2018-03-06 22:30:32.422774] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp]
0-glusterd: Responded to sc5-gluster-10g-1.squaretrade.com (0), ret: 0,
op_ret: -1
[2018-03-06 22:30:32.424621] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk]
0-glusterd: Received RJT from uuid: 77cdfbba-348c-43fe-ab3d-00621904ea9c,
host: sc5-gluster-10g-1.squaretrade.com, port: 0
[2018-03-06 22:30:32.425563] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk]
0-glusterd: Received RJT from uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a,
host: sc5-gluster-2.squaretrade.com, port: 0
[2018-03-06 22:30:32.426706] I [MSGID: 106163]
[glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 30800
[2018-03-06 22:30:32.428075] I [MSGID: 106490] [glusterd-handler.c:2540:__
glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from
uuid: c1877e0d-ccb2-401e-83a6-e4a680af683a
glusterd_compare_friend_volume] 0-management: Version of Cksums
sc5-ovirt_engine differ. local cksum = 3949237931, remote cksum =
2068896937 on peer sc5-gluster-2.squaretrade.com
[2018-03-06 22:30:32.428468] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp]
0-glusterd: Responded to sc5-gluster-2.squaretrade.com (0), ret: 0,
op_ret: -1

Jamie Lawrence

2018-03-07 22:58:32 UTC

Permalink

Please run 'gluster v get all cluster.max-op-version' and what ever value it throws up should be used to bump up the cluster.op-version (gluster v set all cluster.op-version <value>) . With that if you restart the rejected peer I believe the problem should go away, if it doesn't I'd need to investigate further once you can pass down the glusterd and cmd_history log files and the content of /var/lib/glusterd from all the nodes.

Thanks so much - that worked.

I clearly need to catch up on my Gluster reading - I thought I understood op-version, but clearly don't.

Anyway, thanks again for putting up with me.

Cheers,

-j