[Gluster-users] "Incorrect brick" errors

Discussion:

Toby Corkindale

2013-08-06 08:24:39 UTC

Hi,
I'm getting some confusing "Incorrect brick" errors when attempting to
remove OR replace a brick.

gluster> volume info condor

Volume Name: condor
Type: Replicate
Volume ID: 9fef3f76-525f-4bfe-9755-151e0d8279fd
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: mel-storage01:/srv/brick/condor
Brick2: mel-storage02:/srv/brick/condor

gluster> volume remove-brick condor replica 1
mel-storage02:/srv/brick/condor start
Incorrect brick mel-storage02:/srv/brick/condor for volume condor

If that is the incorrect brick, then what have I done wrong?

thanks,
Toby

Toby Corkindale

2013-08-07 01:44:42 UTC

Permalink

Post by Toby Corkindale
Hi,
I'm getting some confusing "Incorrect brick" errors when attempting to
remove OR replace a brick.
gluster> volume info condor
Volume Name: condor
Type: Replicate
Volume ID: 9fef3f76-525f-4bfe-9755-151e0d8279fd
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: mel-storage01:/srv/brick/condor
Brick2: mel-storage02:/srv/brick/condor
gluster> volume remove-brick condor replica 1
mel-storage02:/srv/brick/condor start
Incorrect brick mel-storage02:/srv/brick/condor for volume condor
If that is the incorrect brick, then what have I done wrong?

Note that the log files don't seem to be any use here, they just report:

E [glusterd-brick-ops.c:749:glusterd_handle_remove_brick] 0-: Incorrect
brick mel-storage02:/srv/brick/condor for volume condor

Toby Corkindale

2013-08-08 02:31:24 UTC

Permalink

I never did manage to figure this out.
All attempts to replace-brick failed inexplicably; we could add-brick
but then still not remove-brick the old one, and the new bricks didn't
seem to be functioning properly anyway.

Eventually we just sucked it up and caused a couple of hours of downtime
across all production servers while we brought up a whole new gluster
cluster and moved everything to it.

That's been the final straw for us though -- we're going to ditch
Gluster across the company as soon as possible. It's too risky to keep
using it.
It's been unreliable and unpredictable, and if anything version 3.3 has
been worse than 3.2 for bugs. (And I have no faith at all that 3.4 is an
improvement)

-Toby

Post by Toby Corkindale

E [glusterd-brick-ops.c:749:glusterd_handle_remove_brick] 0-: Incorrect
brick mel-storage02:/srv/brick/condor for volume condor

Krishnan Parthasarathi

2013-08-08 03:09:28 UTC

Permalink

Hi Toby,

----- Original Message -----

I agree that the error message displayed is far from helpful. The reason your
attempt to remove a brick from 1X2 replicate volume failed is because
it is not a 'legal' operation.

Here are some rules and background, that are implicit, about how to determine if a
remove-brick operation is allowed. Some may seem debatable, but
that is how things are today. We could refine them and arrive evolve
better set of rules via discussions on the mailing lists.

1) remove-brick start variant is applicable *only* when you have the dht (or distribute)
type volume. In 3.3, you could identify that by observing the output of "gluster volume info <VOLNAME>".
The "Type" field would display "Distribute-<something>". Additionally, even in a
Distribute type volume, which includes Distribute-Replicate Distribute-Stripe and other combinations,
all the bricks belonging to the subvolume would need to be removed in one go.
For eg,
Lets assume a 2X2 volume V1, with bricks b1, b2, b3, b4, such that b1,b2 form a pair; b3,b4 form the other pair.
If you wanted to use the remove-brick start variant, say for scaling down the volume, you should do the following,

#gluster volume remove-brick V1 b3 b4 start
#gluster volume remove-brick V1 b3 b4 status

Once the remove-brick operation is completed,
#gluster volume remove-brick V1 b3 b4 commit

This would leave volume V1 with bricks b1,b2.

In the above workflow, the data residing in b3,b4 is migrated to
b1,b2.

2) remove-brick (without the 'start' subcommand) can be used to reduce the replica count till 2,
in a Distribute-Replicate type volume. As of today, remove-brick doesn't permit reducing of
replica count in a pure replicate volume. ie. 1XN, where N >= 2.
Note: There is some activity around evolving the 'right' rule. See http://review.gluster.com/#/c/5364/

The above rules have been evolved with the thought that, no legal command must allow the
user to shoot her foot, without a 'repair' path. Put differently, we disallow commands
that might lead to data loss, without the user being fully aware of it.

Hope that helps,
krish

Post by Toby Corkindale
thanks,
Toby
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Toby Corkindale

2013-08-08 04:49:45 UTC

Permalink

Post by Krishnan Parthasarathi
Hi Toby,
----- Original Message -----

I agree that the error message displayed is far from helpful. The reason your
attempt to remove a brick from 1X2 replicate volume failed is because
it is not a 'legal' operation.
Here are some rules and background, that are implicit, about how to determine if a
remove-brick operation is allowed. Some may seem debatable, but
that is how things are today. We could refine them and arrive evolve
better set of rules via discussions on the mailing lists.
1) remove-brick start variant is applicable *only* when you have the dht (or distribute)
type volume. In 3.3, you could identify that by observing the output of "gluster volume info <VOLNAME>".
The "Type" field would display "Distribute-<something>". Additionally, even in a
Distribute type volume, which includes Distribute-Replicate Distribute-Stripe and other combinations,
all the bricks belonging to the subvolume would need to be removed in one go.
For eg,
Lets assume a 2X2 volume V1, with bricks b1, b2, b3, b4, such that b1,b2 form a pair; b3,b4 form the other pair.
If you wanted to use the remove-brick start variant, say for scaling down the volume, you should do the following,
#gluster volume remove-brick V1 b3 b4 start
#gluster volume remove-brick V1 b3 b4 status
Once the remove-brick operation is completed,
#gluster volume remove-brick V1 b3 b4 commit
This would leave volume V1 with bricks b1,b2.
In the above workflow, the data residing in b3,b4 is migrated to
b1,b2.
2) remove-brick (without the 'start' subcommand) can be used to reduce the replica count till 2,
in a Distribute-Replicate type volume. As of today, remove-brick doesn't permit reducing of
replica count in a pure replicate volume. ie. 1XN, where N >= 2.
Note: There is some activity around evolving the 'right' rule. See http://review.gluster.com/#/c/5364/
The above rules have been evolved with the thought that, no legal command must allow the
user to shoot her foot, without a 'repair' path. Put differently, we disallow commands
that might lead to data loss, without the user being fully aware of it.
Hope that helps,
krish

Well, it's a bit of a moot point now, since we had to rebuild the
cluster anyway.

Note that we attempted to raise the replica level to 3 and THEN remove
the old brick, and that failed to work. We also tried using
replace-brick to switch the old one out for the new one. That also
failed with Incorrect Brick. (the replace-brick method was actually the
first way we tried)

As such -- it seems there is no way to replace a failed server with a
new one if you're using the Replicated setup?

Toby