Discussion:
How to remove dead peer, osrry urgent again :(
(too old to reply)
Lindsay Mathieson
2017-06-10 23:38:15 UTC
Permalink
Since my node died on friday I have a dead peer (vna) that needs to be
removed.


I had major issues this morning that I haven't resolve yet with all VM's
going offline when I rebooted a node which I *hope * was due to quorum
issues as I now have four peers in the cluster, one dead, three live.


Confidence level is not high.
--
Lindsay Mathieson
WK
2017-06-11 00:01:34 UTC
Permalink
Post by Lindsay Mathieson
Since my node died on friday I have a dead peer (vna) that needs to be
removed.
I had major issues this morning that I haven't resolve yet with all
VM's going offline when I rebooted a node which I *hope * was due to
quorum issues as I now have four peers in the cluster, one dead, three
live.
Lets see:

According to your previous note, you had vna, vnb and vng all replica 3
in a working cluster.

vna died so you had two 'good' nodes left. All was good.

You replaced vna with vnd but it is probably not fully healed yet cuz
you had 3.8T worth of chunks to copy.

So you had two good nodes (vnb and vng) working and you rebooted one of
them?

If so, yes, based on my experience learning how to deal with failed
nodes you would get a quorum lock under those circumstances UNLESS you
turned off the quorums prior to the reboot.

Do you show any split brains?

As an aside, I wonder if a strategy would have been to first replace VNA
with an arbiter, get the metadata synced up for quorum purposes AND then
turn the arbiter into full node by catching up the chunks.

Is that even possible?

-bill
Lindsay Mathieson
2017-06-11 00:12:29 UTC
Permalink
Post by WK
You replaced vna with vnd but it is probably not fully healed yet cuz
you had 3.8T worth of chunks to copy.
No, the heal had completed. Finished about 9 hours before I rebooted.
Post by WK
So you had two good nodes (vnb and vng) working and you rebooted one
of them?
Three good nodes - vnb, vng, vnh and one dead - vna

from node vng:

***@vng:~# gluster peer status
Number of Peers: 3

Hostname: vna.proxmox.softlog
Uuid: de673495-8cb2-4328-ba00-0419357c03d7
State: Peer in Cluster (Disconnected)

Hostname: vnb.proxmox.softlog
Uuid: 43a1bf8c-3e69-4581-8e16-f2e1462cfc36
State: Peer in Cluster (Connected)

Hostname: vnh.proxmox.softlog
Uuid: 9eb54c33-7f79-4a75-bc2b-67111bf3eae7
State: Peer in Cluster (Connected)
--
Lindsay Mathieson
WK
2017-06-11 00:46:04 UTC
Permalink
Post by Lindsay Mathieson
Three good nodes - vnb, vng, vnh and one dead - vna
Number of Peers: 3
Hostname: vna.proxmox.softlog
Uuid: de673495-8cb2-4328-ba00-0419357c03d7
State: Peer in Cluster (Disconnected)
Hostname: vnb.proxmox.softlog
Uuid: 43a1bf8c-3e69-4581-8e16-f2e1462cfc36
State: Peer in Cluster (Connected)
Hostname: vnh.proxmox.softlog
Uuid: 9eb54c33-7f79-4a75-bc2b-67111bf3eae7
State: Peer in Cluster (Connected)
I thought you had removed vna as defective and then ADDED in vnh as the
replacement?

Why is vna still there?

-bill
Lindsay Mathieson
2017-06-11 00:54:33 UTC
Permalink
Post by WK
I thought you had removed vna as defective and then ADDED in vnh as
the replacement?
Why is vna still there?
Because I *can't* remove it. It died, was unable to be brought up. The
gluster peer detach command only works with live servers - A severe
problem IMHO.
--
Lindsay Mathieson
W Kern
2017-06-11 06:24:20 UTC
Permalink
Post by Lindsay Mathieson
Post by WK
I thought you had removed vna as defective and then ADDED in vnh as
the replacement?
Why is vna still there?
Because I *can't* remove it. It died, was unable to be brought up. The
gluster peer detach command only works with live servers - A severe
problem IMHO.
wow, yes that is problematic.

I wonder if replace-brick would have handled that.
Atin Mukherjee
2017-06-11 08:42:41 UTC
Permalink
Post by Lindsay Mathieson
Post by WK
I thought you had removed vna as defective and then ADDED in vnh as
the replacement?
Why is vna still there?
Because I *can't* remove it. It died, was unable to be brought up. The
gluster peer detach command only works with live servers - A severe
problem IMHO.
If the dead server doesn't host any volumes (bricks of volumes to be
specific) then you can actually remove the uuid entry from
/var/lib/glusterd from other nodes and restart glusterd instances one after
another as a workaround. With Glusterd2, we will see if we can have a
better user experience here.

GD2 team - your thoughts?
Post by Lindsay Mathieson
--
Lindsay Mathieson
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
Lindsay Mathieson
2017-06-11 10:33:29 UTC
Permalink
Post by Atin Mukherjee
If the dead server doesn't host any volumes (bricks of volumes to be
specific) then you can actually remove the uuid entry from
/var/lib/glusterd from other nodes and restart glusterd instances one
after another as a workaround.
The server hosted a brick, but I removed that after it died with
"gluster v remove-brick force". Does that mean I could edit glusterd as
you suggest?
Post by Atin Mukherjee
With Glusterd2, we will see if we can have a better user experience here.
That would be good - I imagine in a lot of cases the only way a server
is removed is *after* it has died.


Thanks.
--
Lindsay Mathieson
Atin Mukherjee
2017-06-11 10:36:28 UTC
Permalink
Post by Lindsay Mathieson
Post by Atin Mukherjee
If the dead server doesn't host any volumes (bricks of volumes to be
specific) then you can actually remove the uuid entry from
/var/lib/glusterd from other nodes and restart glusterd instances one
after another as a workaround.
The server hosted a brick, but I removed that after it died with
"gluster v remove-brick force". Does that mean I could edit glusterd as
you suggest?
Yes
Post by Lindsay Mathieson
Post by Atin Mukherjee
With Glusterd2, we will see if we can have a better user experience here.
That would be good - I imagine in a lot of cases the only way a server
is removed is *after* it has died.
Thanks.
--
Lindsay Mathieson
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
--Atin
Lindsay Mathieson
2017-06-11 10:56:38 UTC
Permalink
Post by Atin Mukherjee
If the dead server doesn't host any volumes (bricks of volumes to be
specific) then you can actually remove the uuid entry from
/var/lib/glusterd from other nodes
Is that just the file entry in "/var/lib/glusterd/peers" ?


e.g I have:

gluster peer status
Number of Peers: 3

Hostname: vnh.proxmox.softlog
Uuid: 9eb54c33-7f79-4a75-bc2b-67111bf3eae7
State: Peer in Cluster (Connected)

*Hostname: vna.proxmox.softlog**
**Uuid: de673495-8cb2-4328-ba00-0419357c03d7**
**State: Peer in Cluster (Disconnected)**
*
Hostname: vnb.proxmox.softlog
Uuid: 43a1bf8c-3e69-4581-8e16-f2e1462cfc36
State: Peer in Cluster (Connected)

Do I just:

rm /var/lib/glusterd/peers/de673495-8cb2-4328-ba00-0419357c03d7


On all the live nodes and restart glusterdd? nothing else?


thanks.
--
Lindsay Mathieson
Atin Mukherjee
2017-06-11 11:00:07 UTC
Permalink
Post by Atin Mukherjee
If the dead server doesn't host any volumes (bricks of volumes to be
specific) then you can actually remove the uuid entry from
/var/lib/glusterd from other nodes
Is that just the file entry in "/var/lib/glusterd/peers" ?
gluster peer status
Number of Peers: 3
Hostname: vnh.proxmox.softlog
Uuid: 9eb54c33-7f79-4a75-bc2b-67111bf3eae7
State: Peer in Cluster (Connected)
*Hostname: vna.proxmox.softlog*
*Uuid: de673495-8cb2-4328-ba00-0419357c03d7*
*State: Peer in Cluster (Disconnected)*
Hostname: vnb.proxmox.softlog
Uuid: 43a1bf8c-3e69-4581-8e16-f2e1462cfc36
State: Peer in Cluster (Connected)
rm /var/lib/glusterd/peers/de673495-8cb2-4328-ba00-0419357c03d7
Yes. And please ensure you do this after bringing down all the glusterd
instances and then once the peer file is removed from all the nodes restart
glusterd on all the nodes one after another.
Post by Atin Mukherjee
On all the live nodes and restart glusterdd? nothing else?
thanks.
--
Lindsay Mathieson
--
- Atin (atinm)
Gandalf Corvotempesta
2017-06-11 11:05:40 UTC
Permalink
Il 11 giu 2017 1:00 PM, "Atin Mukherjee" <***@redhat.com> ha scritto:

Yes. And please ensure you do this after bringing down all the glusterd
instances and then once the peer file is removed from all the nodes restart
glusterd on all the nodes one after another.


If you have to bring down all gluster instances before file removal, you
also bring down the whole gluster storage
Atin Mukherjee
2017-06-11 11:23:07 UTC
Permalink
On Sun, 11 Jun 2017 at 16:35, Gandalf Corvotempesta <
Post by Atin Mukherjee
Yes. And please ensure you do this after bringing down all the glusterd
instances and then once the peer file is removed from all the nodes restart
glusterd on all the nodes one after another.
If you have to bring down all gluster instances before file removal, you
also bring down the whole gluster storage
Until and unless server side quorum is not enabled that's not correct. I/O
path should be active even though management plane is down. We can still
get this done by one node after another with out bringing down all glusterd
instances at one go but just wanted to ensure the workaround is safe and
clean.

_______________________________________________
Post by Atin Mukherjee
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
--Atin
Lindsay Mathieson
2017-06-11 11:44:49 UTC
Permalink
Post by Atin Mukherjee
Until and unless server side quorum is not enabled that's not correct.
I/O path should be active even though management plane is down. We can
still get this done by one node after another with out bringing down
all glusterd instances at one go but just wanted to ensure the
workaround is safe and clean.
Not quite sure of your wording here but I

* brought down all glusterd with "systemctl stop
glusterfs-server.service" on each node
* rm /var/lib/glusterd/peers/de673495-8cb2-4328-ba00-0419357c03d7 on
each node
* systemctl start glusterfs-server.service" on each node


Several hundred shards needed to be healed after that, but all done now
with no split-brain. And:

***@vng:~# gluster peer status
Number of Peers: 2

Hostname: vnh.proxmox.softlog
Uuid: 9eb54c33-7f79-4a75-bc2b-67111bf3eae7
State: Peer in Cluster (Connected)

Hostname: vnb.proxmox.softlog
Uuid: 43a1bf8c-3e69-4581-8e16-f2e1462cfc36
State: Peer in Cluster (Connected)


Which is good. Not in a position to test quorum by rebooting a node
right now though :) but I'm going to assume its all good, probably test
next weekend.

Thanks for all the help, much appreciated.
--
Lindsay Mathieson
Pranith Kumar Karampuri
2017-06-12 16:56:33 UTC
Permalink
On Sun, 11 Jun 2017 at 06:25, Lindsay Mathieson <
Post by Lindsay Mathieson
Post by WK
I thought you had removed vna as defective and then ADDED in vnh as
the replacement?
Why is vna still there?
Because I *can't* remove it. It died, was unable to be brought up. The
gluster peer detach command only works with live servers - A severe
problem IMHO.
If the dead server doesn't host any volumes (bricks of volumes to be
specific) then you can actually remove the uuid entry from
/var/lib/glusterd from other nodes and restart glusterd instances one after
another as a workaround. With Glusterd2, we will see if we can have a
better user experience here.
We can also do "gluster peer detach <hostname> force right?
GD2 team - your thoughts?
Post by Lindsay Mathieson
--
Lindsay Mathieson
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
- Atin (atinm)
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
Lindsay Mathieson
2017-06-12 21:05:51 UTC
Permalink
Post by Pranith Kumar Karampuri
We can also do "gluster peer detach <hostname> force right?
Tried that, didn't work - threw an error.
--
Lindsay Mathieson
Lindsay Mathieson
2017-06-13 01:09:31 UTC
Permalink
Post by Pranith Kumar Karampuri
We can also do "gluster peer detach <hostname> force right?
Just to be sure I setup a test 3 node vm gluster cluster :) then shut down
one of the nodes and tried to remove it.


***@gh1:~# gluster peer status
Number of Peers: 2

Hostname: gh2.brian.softlog
Uuid: b59c32a5-eb10-4630-b147-890a98d0e51d
State: Peer in Cluster (Connected)

Hostname: gh3.brian.softlog
Uuid: 825afc5c-ead6-4c83-97a0-fbc9d8e19e62
State: Peer in Cluster (Disconnected


***@gh1:~# gluster peer detach gh3 force
peer detach: failed: gh3 is not part of cluster
--
Lindsay
Atin Mukherjee
2017-06-13 01:15:23 UTC
Permalink
Post by Lindsay Mathieson
Post by Pranith Kumar Karampuri
We can also do "gluster peer detach <hostname> force right?
Just to be sure I setup a test 3 node vm gluster cluster :) then shut down
one of the nodes and tried to remove it.
Number of Peers: 2
Hostname: gh2.brian.softlog
Uuid: b59c32a5-eb10-4630-b147-890a98d0e51d
State: Peer in Cluster (Connected)
Hostname: gh3.brian.softlog
Uuid: 825afc5c-ead6-4c83-97a0-fbc9d8e19e62
State: Peer in Cluster (Disconnected
peer detach: failed: gh3 is not part of cluster
This looks like a bug in the error code as the error message is wrong. I'll
take a look at it and get back.
Post by Lindsay Mathieson
--
Lindsay
--
- Atin (atinm)
Lindsay Mathieson
2017-06-13 01:42:27 UTC
Permalink
Post by Atin Mukherjee
This looks like a bug in the error code as the error message is wrong.
I'll take a look at it and get back.
I had a thought (they do happen) and tried some further testing.

***@gh1:~# gluster peer status
Number of Peers: 2

Hostname: gh2.brian.softlog
Uuid: b59c32a5-eb10-4630-b147-890a98d0e51d
State: Peer in Cluster (Connected)

Hostname: gh3.brian.softlog
Uuid: 825afc5c-ead6-4c83-97a0-fbc9d8e19e62
State: Peer in Cluster (Disconnected)
***@gh1:~# *gluster peer detach gh3*
peer detach: failed: gh3 is not part of cluster
***@gh1:~# *gluster peer detach gh3.brian.softlog*
peer detach: success


Specifying the FQDN of the peer seems to do the trick, with no need for
force either.

nb: Just using the host name while the host is up does work.
--
Lindsay
Lindsay Mathieson
2017-06-11 00:57:42 UTC
Permalink
Post by Lindsay Mathieson
Since my node died on friday I have a dead peer (vna) that needs to be
removed.
I had major issues this morning that I haven't resolve yet with all
VM's going offline when I rebooted a node which I *hope * was due to
quorum issues as I now have four peers in the cluster, one dead, three
live.
Confidence level is not high.
It definitely appears to be quorum issues. Rebooting a node makes the
volume inaccessible. All is fine once its backup.


I did a

| gluster volume set all cluster.server-quorum-ratio 51%|

And that has resolved my issue for now as it allows two servers to form
a quorum.|
|

|
|
--
Lindsay Mathieson
Lindsay Mathieson
2017-06-11 01:01:42 UTC
Permalink
Post by Lindsay Mathieson
I did a
| gluster volume set all cluster.server-quorum-ratio 51%|
And that has resolved my issue for now as it allows two servers to
form a quorum.|
|
Edit :)


Actually

| gluster volume set all cluster.server-quorum-ratio 50%|
--
Lindsay Mathieson
W Kern
2017-06-11 06:29:25 UTC
Permalink
Post by Lindsay Mathieson
Post by Lindsay Mathieson
I did a
| gluster volume set all cluster.server-quorum-ratio 51%|
And that has resolved my issue for now as it allows two servers to
form a quorum.|
|
Edit :)
Actually
| gluster volume set all cluster.server-quorum-ratio 50%|
ok, good to know, but somebody still needs to let us know how to clean
up the "replaced/dead" peer.

Im going to have to play with that scenario on my testbed when I get a
chance.

-bill
Loading...