Discussion:
Phasing out replace-brick for data migration in favor of remove-brick.
(too old to reply)
Anand Avati
2013-09-27 07:35:51 UTC
Permalink
Hello all,
DHT's remove-brick + rebalance has been enhanced in the last couple of
releases to be quite sophisticated. It can handle graceful decommissioning
of bricks, including open file descriptors and hard links.

This in a way is a feature overlap with replace-brick's data migration
functionality. Replace-brick's data migration is currently also used for
planned decommissioning of a brick.

Reasons to remove replace-brick (or why remove-brick is better):

- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.

- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually uses
self-heal internally)

- In a non-replicated config if a server is getting replaced by a new one,
add-brick <new> + remove-brick <old> "start" achieves the same goal as
replace-brick <old> <new> "start".

- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.

- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out the
data of the bring being removed amongst the remaining servers.

- Replace-brick code is complex and messy (the real reason :p).

- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.

I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.

NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.

Please do ask any questions / raise concerns at this stage :)

Avati
James
2013-09-27 08:56:10 UTC
Permalink
Post by Anand Avati
Hello all,
Hey,

Interesting timing for this post...
I've actually started working on automatic brick addition/removal. (I'm
planning to add this to puppet-gluster of course.) I was hoping you
could help out with the algorithm. I think it's a bit different if
there's no replace-brick command as you are proposing.

Here's the problem:
Given a logically optimal initial volume:

volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2

suppose I know that I want to add/remove bricks such that my new volume
(if I had created it new) looks like:

volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
h5:/b2 h6:/b2

What is the optimal algorithm for determining the correct sequence of
transforms that are needed to accomplish this task. Obviously there are
some simpler corner cases, but I'd like to solve the general case.

The transforms are obviously things like running the add-brick {...} and
remove-brick {...} commands.

Obviously we have to take into account that it's better to add bricks
and rebalance before we remove bricks and risk the file system if a
replica is missing. The algorithm should work for any replica N. We want
to make sure the new layout makes sense to replicate the data on
different servers. In many cases, this will require creating a circular
"chain" of bricks as illustrated in the bottom of this image:
Loading Image...
for example. I'd like to optimize for safety first, and then time, I
imagine.

Many thanks in advance.

James

Some comments below, although I'm a bit tired so I hope I said it all
right.
Post by Anand Avati
DHT's remove-brick + rebalance has been enhanced in the last couple of
releases to be quite sophisticated. It can handle graceful decommissioning
of bricks, including open file descriptors and hard links.
Sweet
Post by Anand Avati
This in a way is a feature overlap with replace-brick's data migration
functionality. Replace-brick's data migration is currently also used for
planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually uses
self-heal internally)
- In a non-replicated config if a server is getting replaced by a new one,
add-brick <new> + remove-brick <old> "start" achieves the same goal as
replace-brick <old> <new> "start".
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.
- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out the
data of the bring being removed amongst the remaining servers.
Can you talk more about the replica = N case (where N is 2 or 3?)
With remove brick, add brick you will need add/remove N (replica count)
bricks at a time, right? With replace brick, you could just swap out
one, right? Isn't that a missing feature if you remove replace brick?
Post by Anand Avati
- Replace-brick code is complex and messy (the real reason :p).
- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.
I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.
NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.
Please do ask any questions / raise concerns at this stage :)
I heard with 3.4 you can somehow change the replica count when adding
new bricks... What's the full story here please?

Thanks!
James
Post by Anand Avati
Avati
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Anand Avati
2013-09-30 05:41:56 UTC
Permalink
Post by James
Post by Anand Avati
Hello all,
Hey,
Interesting timing for this post...
I've actually started working on automatic brick addition/removal. (I'm
planning to add this to puppet-gluster of course.) I was hoping you
could help out with the algorithm. I think it's a bit different if
there's no replace-brick command as you are proposing.
volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
suppose I know that I want to add/remove bricks such that my new volume
volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
h5:/b2 h6:/b2
What is the optimal algorithm for determining the correct sequence of
transforms that are needed to accomplish this task. Obviously there are
some simpler corner cases, but I'd like to solve the general case.
The transforms are obviously things like running the add-brick {...} and
remove-brick {...} commands.
Obviously we have to take into account that it's better to add bricks
and rebalance before we remove bricks and risk the file system if a
replica is missing. The algorithm should work for any replica N. We want
to make sure the new layout makes sense to replicate the data on
different servers. In many cases, this will require creating a circular
http://joejulian.name/media/uploads/images/replica_expansion.png
for example. I'd like to optimize for safety first, and then time, I
imagine.
Many thanks in advance.
I see what you are asking. First of all, when running a 2-replica volume
you almost pretty much always want to have an even number of servers, and
add servers in even numbers. Ideally the two "sides" of the replicas should
be placed in separate failures zones - separate racks with separate power
supplies or separate AZs in the cloud. Having an odd number of servers with
an 2 replicas is a very "odd" configuration. In all these years I am yet to
come across a customer who has a production cluster with 2 replicas and an
odd number of servers. And setting up replicas in such a chained manner
makes it hard to reason about availability, especially when you are trying
recover from a disaster. Having clear and separate "pairs" is definitely
what is recommended.

That being said, nothing prevents one from setting up a chain like above as
long as you are comfortable with the complexity of the configuration. And
phasing out replace-brick in favor of add-brick/remove-brick does not make
the above configuration impossible either. Let's say you have a chained
configuration of N servers, with pairs formed between every:

h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1

Now you add N+1th server.

Using replace-brick, you have been doing thus far:

1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous brick"
2. replace-brick h0:/b2 hN:/b2 start ... commit

In case you are doing an add-brick/remove-brick approach, you would now
instead do:

1. add-brick h(N-1):/b1a hN:/b2
2. add-brick hN:/b1 h0:/b2a
3. remove-brick h(N-1):/b1 h0:/b2 start ... commit

You will not be left with only 1 copy of a file at any point in the
process, and achieve the same "end result" as you were with replace-brick.
As mentioned before, I once again request you to consider if you really
want to deal with the configuration complexity of having chained
replication, instead of just adding servers in pairs.

Please ask if there are any more questions or concerns.

Avati
Post by James
James
Some comments below, although I'm a bit tired so I hope I said it all
right.
Post by Anand Avati
DHT's remove-brick + rebalance has been enhanced in the last couple of
releases to be quite sophisticated. It can handle graceful
decommissioning
Post by Anand Avati
of bricks, including open file descriptors and hard links.
Sweet
Post by Anand Avati
This in a way is a feature overlap with replace-brick's data migration
functionality. Replace-brick's data migration is currently also used for
planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually uses
self-heal internally)
- In a non-replicated config if a server is getting replaced by a new
one,
Post by Anand Avati
add-brick <new> + remove-brick <old> "start" achieves the same goal as
replace-brick <old> <new> "start".
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.
- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out
the
Post by Anand Avati
data of the bring being removed amongst the remaining servers.
Can you talk more about the replica = N case (where N is 2 or 3?)
With remove brick, add brick you will need add/remove N (replica count)
bricks at a time, right? With replace brick, you could just swap out
one, right? Isn't that a missing feature if you remove replace brick?
Post by Anand Avati
- Replace-brick code is complex and messy (the real reason :p).
- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.
I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.
NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.
Please do ask any questions / raise concerns at this stage :)
I heard with 3.4 you can somehow change the replica count when adding
new bricks... What's the full story here please?
Thanks!
James
Post by Anand Avati
Avati
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
KueiHuan Chen
2013-10-03 15:57:18 UTC
Permalink
Hi, Avati

In your chained configuration, how to replace whole h1 without
replace-brick ? Is there has a better way than replace brick in this
situation ?

h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old h1.)

Thanks.
Best Regards,

KueiHuan-Chen
Synology Incorporated.
Email: ***@synology.com
Tel: +886-2-25521814 ext.827
Post by James
Post by Anand Avati
Hello all,
Hey,
Interesting timing for this post...
I've actually started working on automatic brick addition/removal. (I'm
planning to add this to puppet-gluster of course.) I was hoping you
could help out with the algorithm. I think it's a bit different if
there's no replace-brick command as you are proposing.
volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
suppose I know that I want to add/remove bricks such that my new volume
volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
h5:/b2 h6:/b2
What is the optimal algorithm for determining the correct sequence of
transforms that are needed to accomplish this task. Obviously there are
some simpler corner cases, but I'd like to solve the general case.
The transforms are obviously things like running the add-brick {...} and
remove-brick {...} commands.
Obviously we have to take into account that it's better to add bricks
and rebalance before we remove bricks and risk the file system if a
replica is missing. The algorithm should work for any replica N. We want
to make sure the new layout makes sense to replicate the data on
different servers. In many cases, this will require creating a circular
http://joejulian.name/media/uploads/images/replica_expansion.png
for example. I'd like to optimize for safety first, and then time, I
imagine.
Many thanks in advance.
I see what you are asking. First of all, when running a 2-replica volume you
almost pretty much always want to have an even number of servers, and add
servers in even numbers. Ideally the two "sides" of the replicas should be
placed in separate failures zones - separate racks with separate power
supplies or separate AZs in the cloud. Having an odd number of servers with
an 2 replicas is a very "odd" configuration. In all these years I am yet to
come across a customer who has a production cluster with 2 replicas and an
odd number of servers. And setting up replicas in such a chained manner
makes it hard to reason about availability, especially when you are trying
recover from a disaster. Having clear and separate "pairs" is definitely
what is recommended.
That being said, nothing prevents one from setting up a chain like above as
long as you are comfortable with the complexity of the configuration. And
phasing out replace-brick in favor of add-brick/remove-brick does not make
the above configuration impossible either. Let's say you have a chained
h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1
Now you add N+1th server.
1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous brick"
2. replace-brick h0:/b2 hN:/b2 start ... commit
In case you are doing an add-brick/remove-brick approach, you would now
1. add-brick h(N-1):/b1a hN:/b2
2. add-brick hN:/b1 h0:/b2a
3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
You will not be left with only 1 copy of a file at any point in the process,
and achieve the same "end result" as you were with replace-brick. As
mentioned before, I once again request you to consider if you really want to
deal with the configuration complexity of having chained replication,
instead of just adding servers in pairs.
Please ask if there are any more questions or concerns.
Avati
Post by James
James
Some comments below, although I'm a bit tired so I hope I said it all
right.
Post by Anand Avati
DHT's remove-brick + rebalance has been enhanced in the last couple of
releases to be quite sophisticated. It can handle graceful
decommissioning
of bricks, including open file descriptors and hard links.
Sweet
Post by Anand Avati
This in a way is a feature overlap with replace-brick's data migration
functionality. Replace-brick's data migration is currently also used for
planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually uses
self-heal internally)
- In a non-replicated config if a server is getting replaced by a new one,
add-brick <new> + remove-brick <old> "start" achieves the same goal as
replace-brick <old> <new> "start".
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.
- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out the
data of the bring being removed amongst the remaining servers.
Can you talk more about the replica = N case (where N is 2 or 3?)
With remove brick, add brick you will need add/remove N (replica count)
bricks at a time, right? With replace brick, you could just swap out
one, right? Isn't that a missing feature if you remove replace brick?
Post by Anand Avati
- Replace-brick code is complex and messy (the real reason :p).
- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.
I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.
NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.
Please do ask any questions / raise concerns at this stage :)
I heard with 3.4 you can somehow change the replica count when adding
new bricks... What's the full story here please?
Thanks!
James
Post by Anand Avati
Avati
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Anand Avati
2013-10-03 16:27:37 UTC
Permalink
Post by KueiHuan Chen
Hi, Avati
In your chained configuration, how to replace whole h1 without
replace-brick ? Is there has a better way than replace brick in this
situation ?
h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old h1.)
You have a couple of options,

A)

replace-brick h1:/b1 h3:/b1
replace-brick h1:/b2 h3:/b2

and let self-heal bring the disks up to speed, or

B)

add-brick replica 2 h3:/b1 h2:/b2a
add-brick replica 2 h3:/b2 h0:/b1a

remove-brick h0:/b1 h1:/b2 start .. commit
remove-brick h2:/b2 h1:/b1 start .. commit

Let me know if you still have questions.

Avati
Post by KueiHuan Chen
Thanks.
Best Regards,
KueiHuan-Chen
Synology Incorporated.
Tel: +886-2-25521814 ext.827
Post by Anand Avati
Post by James
Post by Anand Avati
Hello all,
Hey,
Interesting timing for this post...
I've actually started working on automatic brick addition/removal. (I'm
planning to add this to puppet-gluster of course.) I was hoping you
could help out with the algorithm. I think it's a bit different if
there's no replace-brick command as you are proposing.
volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
suppose I know that I want to add/remove bricks such that my new volume
volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
h5:/b2 h6:/b2
What is the optimal algorithm for determining the correct sequence of
transforms that are needed to accomplish this task. Obviously there are
some simpler corner cases, but I'd like to solve the general case.
The transforms are obviously things like running the add-brick {...} and
remove-brick {...} commands.
Obviously we have to take into account that it's better to add bricks
and rebalance before we remove bricks and risk the file system if a
replica is missing. The algorithm should work for any replica N. We want
to make sure the new layout makes sense to replicate the data on
different servers. In many cases, this will require creating a circular
http://joejulian.name/media/uploads/images/replica_expansion.png
for example. I'd like to optimize for safety first, and then time, I
imagine.
Many thanks in advance.
I see what you are asking. First of all, when running a 2-replica volume
you
Post by Anand Avati
almost pretty much always want to have an even number of servers, and add
servers in even numbers. Ideally the two "sides" of the replicas should
be
Post by Anand Avati
placed in separate failures zones - separate racks with separate power
supplies or separate AZs in the cloud. Having an odd number of servers
with
Post by Anand Avati
an 2 replicas is a very "odd" configuration. In all these years I am yet
to
Post by Anand Avati
come across a customer who has a production cluster with 2 replicas and
an
Post by Anand Avati
odd number of servers. And setting up replicas in such a chained manner
makes it hard to reason about availability, especially when you are
trying
Post by Anand Avati
recover from a disaster. Having clear and separate "pairs" is definitely
what is recommended.
That being said, nothing prevents one from setting up a chain like above
as
Post by Anand Avati
long as you are comfortable with the complexity of the configuration. And
phasing out replace-brick in favor of add-brick/remove-brick does not
make
Post by Anand Avati
the above configuration impossible either. Let's say you have a chained
h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1
Now you add N+1th server.
1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous
brick"
Post by Anand Avati
2. replace-brick h0:/b2 hN:/b2 start ... commit
In case you are doing an add-brick/remove-brick approach, you would now
1. add-brick h(N-1):/b1a hN:/b2
2. add-brick hN:/b1 h0:/b2a
3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
You will not be left with only 1 copy of a file at any point in the
process,
Post by Anand Avati
and achieve the same "end result" as you were with replace-brick. As
mentioned before, I once again request you to consider if you really
want to
Post by Anand Avati
deal with the configuration complexity of having chained replication,
instead of just adding servers in pairs.
Please ask if there are any more questions or concerns.
Avati
Post by James
James
Some comments below, although I'm a bit tired so I hope I said it all
right.
Post by Anand Avati
DHT's remove-brick + rebalance has been enhanced in the last couple of
releases to be quite sophisticated. It can handle graceful decommissioning
of bricks, including open file descriptors and hard links.
Sweet
Post by Anand Avati
This in a way is a feature overlap with replace-brick's data migration
functionality. Replace-brick's data migration is currently also used
for
Post by Anand Avati
Post by James
Post by Anand Avati
planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually
uses
Post by Anand Avati
Post by James
Post by Anand Avati
self-heal internally)
- In a non-replicated config if a server is getting replaced by a new one,
add-brick <new> + remove-brick <old> "start" achieves the same goal as
replace-brick <old> <new> "start".
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.
- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out the
data of the bring being removed amongst the remaining servers.
Can you talk more about the replica = N case (where N is 2 or 3?)
With remove brick, add brick you will need add/remove N (replica count)
bricks at a time, right? With replace brick, you could just swap out
one, right? Isn't that a missing feature if you remove replace brick?
Post by Anand Avati
- Replace-brick code is complex and messy (the real reason :p).
- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.
I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.
NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.
Please do ask any questions / raise concerns at this stage :)
I heard with 3.4 you can somehow change the replica count when adding
new bricks... What's the full story here please?
Thanks!
James
Post by Anand Avati
Avati
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Anand Avati
2013-10-10 21:30:00 UTC
Permalink
http://review.gluster.org/#/c/6031/ (patch to remove replace-brick data
migration) is slated for merge before 3.5. Review comments (on gerrit)
welcome.

Thanks,
Avati
Post by Anand Avati
Post by KueiHuan Chen
Hi, Avati
In your chained configuration, how to replace whole h1 without
replace-brick ? Is there has a better way than replace brick in this
situation ?
h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old h1.)
You have a couple of options,
A)
replace-brick h1:/b1 h3:/b1
replace-brick h1:/b2 h3:/b2
and let self-heal bring the disks up to speed, or
B)
add-brick replica 2 h3:/b1 h2:/b2a
add-brick replica 2 h3:/b2 h0:/b1a
remove-brick h0:/b1 h1:/b2 start .. commit
remove-brick h2:/b2 h1:/b1 start .. commit
Let me know if you still have questions.
Avati
Post by KueiHuan Chen
Thanks.
Best Regards,
KueiHuan-Chen
Synology Incorporated.
Tel: +886-2-25521814 ext.827
Post by Anand Avati
Post by James
Post by Anand Avati
Hello all,
Hey,
Interesting timing for this post...
I've actually started working on automatic brick addition/removal. (I'm
planning to add this to puppet-gluster of course.) I was hoping you
could help out with the algorithm. I think it's a bit different if
there's no replace-brick command as you are proposing.
volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
suppose I know that I want to add/remove bricks such that my new volume
volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
h5:/b2 h6:/b2
What is the optimal algorithm for determining the correct sequence of
transforms that are needed to accomplish this task. Obviously there are
some simpler corner cases, but I'd like to solve the general case.
The transforms are obviously things like running the add-brick {...}
and
Post by Anand Avati
Post by James
remove-brick {...} commands.
Obviously we have to take into account that it's better to add bricks
and rebalance before we remove bricks and risk the file system if a
replica is missing. The algorithm should work for any replica N. We
want
Post by Anand Avati
Post by James
to make sure the new layout makes sense to replicate the data on
different servers. In many cases, this will require creating a circular
http://joejulian.name/media/uploads/images/replica_expansion.png
for example. I'd like to optimize for safety first, and then time, I
imagine.
Many thanks in advance.
I see what you are asking. First of all, when running a 2-replica
volume you
Post by Anand Avati
almost pretty much always want to have an even number of servers, and
add
Post by Anand Avati
servers in even numbers. Ideally the two "sides" of the replicas should
be
Post by Anand Avati
placed in separate failures zones - separate racks with separate power
supplies or separate AZs in the cloud. Having an odd number of servers
with
Post by Anand Avati
an 2 replicas is a very "odd" configuration. In all these years I am
yet to
Post by Anand Avati
come across a customer who has a production cluster with 2 replicas and
an
Post by Anand Avati
odd number of servers. And setting up replicas in such a chained manner
makes it hard to reason about availability, especially when you are
trying
Post by Anand Avati
recover from a disaster. Having clear and separate "pairs" is definitely
what is recommended.
That being said, nothing prevents one from setting up a chain like
above as
Post by Anand Avati
long as you are comfortable with the complexity of the configuration.
And
Post by Anand Avati
phasing out replace-brick in favor of add-brick/remove-brick does not
make
Post by Anand Avati
the above configuration impossible either. Let's say you have a chained
h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1
Now you add N+1th server.
1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous
brick"
Post by Anand Avati
2. replace-brick h0:/b2 hN:/b2 start ... commit
In case you are doing an add-brick/remove-brick approach, you would now
1. add-brick h(N-1):/b1a hN:/b2
2. add-brick hN:/b1 h0:/b2a
3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
You will not be left with only 1 copy of a file at any point in the
process,
Post by Anand Avati
and achieve the same "end result" as you were with replace-brick. As
mentioned before, I once again request you to consider if you really
want to
Post by Anand Avati
deal with the configuration complexity of having chained replication,
instead of just adding servers in pairs.
Please ask if there are any more questions or concerns.
Avati
Post by James
James
Some comments below, although I'm a bit tired so I hope I said it all
right.
Post by Anand Avati
DHT's remove-brick + rebalance has been enhanced in the last couple
of
Post by Anand Avati
Post by James
Post by Anand Avati
releases to be quite sophisticated. It can handle graceful decommissioning
of bricks, including open file descriptors and hard links.
Sweet
Post by Anand Avati
This in a way is a feature overlap with replace-brick's data
migration
Post by Anand Avati
Post by James
Post by Anand Avati
functionality. Replace-brick's data migration is currently also used
for
Post by Anand Avati
Post by James
Post by Anand Avati
planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually
uses
Post by Anand Avati
Post by James
Post by Anand Avati
self-heal internally)
- In a non-replicated config if a server is getting replaced by a new one,
add-brick <new> + remove-brick <old> "start" achieves the same goal
as
Post by Anand Avati
Post by James
Post by Anand Avati
replace-brick <old> <new> "start".
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.
- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread
out
Post by Anand Avati
Post by James
Post by Anand Avati
the
data of the bring being removed amongst the remaining servers.
Can you talk more about the replica = N case (where N is 2 or 3?)
With remove brick, add brick you will need add/remove N (replica count)
bricks at a time, right? With replace brick, you could just swap out
one, right? Isn't that a missing feature if you remove replace brick?
Post by Anand Avati
- Replace-brick code is complex and messy (the real reason :p).
- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.
I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.
NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.
Please do ask any questions / raise concerns at this stage :)
I heard with 3.4 you can somehow change the replica count when adding
new bricks... What's the full story here please?
Thanks!
James
Post by Anand Avati
Avati
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
James
2013-10-11 04:20:20 UTC
Permalink
Post by Anand Avati
I see what you are asking. First of all, when running a 2-replica volume
you almost pretty much always want to have an even number of servers, and
add servers in even numbers. Ideally the two "sides" of the replicas should
be placed in separate failures zones - separate racks with separate power
supplies or separate AZs in the cloud. Having an odd number of servers with
an 2 replicas is a very "odd" configuration. In all these years I am yet to
come across a customer who has a production cluster with 2 replicas and an
odd number of servers. And setting up replicas in such a chained manner
makes it hard to reason about availability, especially when you are trying
recover from a disaster. Having clear and separate "pairs" is
definitely
what is recommended.
Obviously I completely agree. In fact, I've written most of the code for
this scenario, however I'm trying to build out my code to support the
general case.
Post by Anand Avati
That being said, nothing prevents one from setting up a chain like above as
long as you are comfortable with the complexity of the configuration. And
phasing out replace-brick in favor of add-brick/remove-brick does not make
the above configuration impossible either. Let's say you have a chained
h(i):/b1 h((i+1) % N):/b2 | i := 0 -> N-1
Perfect... So far, so good.
Post by Anand Avati
Now you add N+1th server.
This server will be "N" because, we're zero-based in your example...
Post by Anand Avati
1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was "part of a previous brick"
Here is that server, we complete the chain from hN to H0. Let's change
the name of h0:/b2a to h0:/b2-tmp instead. The problem is that this
hopes we have room for a b2-tmp on h0 !
Post by Anand Avati
2. replace-brick h0:/b2 hN:/b2 start ... commit
Here if you meant h0:/b2a aka h0:/b2-tmp (instead of h0:/b2) doesn't
this break the chain ? Since now hN is now a stand alone with b1 and b2,
and not part of the chain? In fact, the b1 and b2 on hN are actually
replicas of each other so this is a SPOF.
Post by Anand Avati
In case you are doing an add-brick/remove-brick approach, you would now
1. add-brick h(N-1):/b1a hN:/b2
2. add-brick hN:/b1 h0:/b2a
3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
I think this algorithm works. Although I'd have to test it :P
The one downside (which I actually have a work around to) is that the
new bricks have to be named different things than the original ones. Is
there a way around this?
Post by Anand Avati
You will not be left with only 1 copy of a file at any point in the
process, and achieve the same "end result" as you were with
replace-brick.
As mentioned before, I once again request you to consider if you really
want to deal with the configuration complexity of having chained
replication, instead of just adding servers in pairs.
I am just trying to avoid corner cases in my code. Puppet won't work
well with those :P
Post by Anand Avati
Please ask if there are any more questions or concerns.
I have some follow up, but for the moment, I have another question to
add into this thread. It's the same idea really... Suppose you have a
set of sanely named and ordered hosts and bricks. Is there one (and only
one) logical ordering for them? I've decided that the answer is yes, and
I've written the algorithm for ordering them:

https://github.com/purpleidea/puppet-gluster/blob/master/lib/facter/gluster_bricks.rb#L77

Do you have any comments / objections ?

I've attached an easy standalone version of this code to run.
(brick_logic_ordering_wip.rb)

I also have a more complicated version of this code.
(brick_logic_ordering_v2_wip.rb)
This code does almost the same thing as the first version.
The difference is that this version supports a proposed "brick
nomenclature". (See below)

What does this all mean? My theory: If you can define a logical brick
and hostname naming convention, and that you always use it, then for
every given list of bricks, there should be only one logical
"ordering" (where an ordering is the linear order needed for a create
volume command).

Secondly, if you want to add or remove bricks, and you do so by
following the naming convention, then the combined old list + new bricks
can also be sorted in a single linear ordering. Furthermore, there
exists an algorithm that can compute the needed add/remove brick
commands to transform from the initial set to the second set.

I've attached this algorithm here:
(brick_logic_transform_v1_wip.rb)

The only other thing to mention is the brick nomenclature:
It is:
/path/bxxxxxxx#vzzzz

where b is a constant char 'b'
where xxxxxxx is a zero padded int for brick #
where #vzzzz is a constant '#v' followed by zzzz
where zzzz is a zero padded int for version #

each time new bricks are added, you increment the max visible version #
and use that. if no version number is specified, then we assume version
1. The length of padding must be decided on in advance and can't be
changed.

valid brick names include:

/data/b000004

/data/b000022#v0003

and so on...

Hostnames are simple: hostnameYYYY where YYYY is a padded int, and you
distribute your hosts sequentially across racks or switches or whatever
your commonality for SPOF is.

Technically, for the transforms, I'm not even sure the version # is
necessary.

The big problem with my algorithms, is that they don't work for chained
configurations. I'd love to be able to make that so!!!

Why is all this relevant ? Because if I can solve these problems,
Gluster users can have fully decentralized elastic volumes that
grow/shrink on demand, without ever having to manually run add/remove
brick commands. I'll be able to do all of this with puppet-gluster for
example. Users will just run puppet, without changing and
configurations, and hosts will automatically come up and grow to the
size the hardware supports. Most of the code is already published. More
to come.

Hope that was all understandable. It's probably hard to talk about this
by email, but I'm trying. :)

Cheers,
James
Post by Anand Avati
Avati
Amar Tumballi
2013-09-30 09:26:22 UTC
Permalink
Inline response.
Post by James
Post by Anand Avati
Hello all,
Hey,
Interesting timing for this post...
I've actually started working on automatic brick addition/removal. (I'm
planning to add this to puppet-gluster of course.) I was hoping you
could help out with the algorithm. I think it's a bit different if
there's no replace-brick command as you are proposing.
volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
suppose I know that I want to add/remove bricks such that my new volume
volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
h5:/b2 h6:/b2
What is the optimal algorithm for determining the correct sequence of
transforms that are needed to accomplish this task. Obviously there are
some simpler corner cases, but I'd like to solve the general case.
The transforms are obviously things like running the add-brick {...} and
remove-brick {...} commands.
This is the exact reason why we recommend in our best practice to have a
directory inside a mountpoint exported as a brick, in this case,
h1:/b1/d1 (where d1 is a directory inside mountpoint /b1).

This helps in having a brick h1:/b1/d2 which is technically the same
thing you would like to have in VolB.

Also, it is never good to swap/change/move replica pairs to different
sets... would lead into many issues, like duplicate files, etc etc..
Post by James
Post by Anand Avati
- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out the
data of the bring being removed amongst the remaining servers.
Can you talk more about the replica = N case (where N is 2 or 3?)
With remove brick, add brick you will need add/remove N (replica count)
bricks at a time, right? With replace brick, you could just swap out
one, right? Isn't that a missing feature if you remove replace brick?
For that particular swapping without data migration, you will still have
'replace-brick' existing. What it does is replace an existing brick of a
replica pair with an empty brick, so replicate's self-heal daemon
populates the data in it.
Post by James
Post by Anand Avati
Please do ask any questions / raise concerns at this stage :)
I heard with 3.4 you can somehow change the replica count when adding
new bricks... What's the full story here please?
Yes, support in CLI for this existed with glusterfs-3.3.x
(http://review.gluster.com/158) itself, just that there are few bugs.

syntax of add-brick:

gluster volume add-brick <VOLNAME> [<stripe|replica> <COUNT>]
<NEW-BRICK> ... [force] - add brick to volume <VOLNAME>

If you give 'replica N' where N is already existing replica count -1/+1.

Regards,
Amar
Amar Tumballi
2013-09-27 17:15:52 UTC
Permalink
Post by Anand Avati
Hello all,
DHT's remove-brick + rebalance has been enhanced in the last couple of
releases to be quite sophisticated. It can handle graceful decommissioning
of bricks, including open file descriptors and hard links.
Last set of patches for this should be reviewed and accepted before we make
that claim :-) [ http://review.gluster.org/5891 ]
Post by Anand Avati
This in a way is a feature overlap with replace-brick's data migration
functionality. Replace-brick's data migration is currently also used for
planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the users and
hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary, because
self-healing itself will recreate the data (replace-brick actually uses
self-heal internally)
- In a non-replicated config if a server is getting replaced by a new one,
add-brick <new> + remove-brick <old> "start" achieves the same goal as
replace-brick <old> <new> "start".
Should we phase out CLI of doing a 'remove-brick' without any option too?
because even if users do it by mistake, they would loose data. We should
enforce 'start' and then 'commit' usage of remove-brick. Also if old method
is required for anyone, they anyways have 'force' option.
Post by Anand Avati
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.
+10 (thats the number of bugs open on these things :-)
Post by Anand Avati
- Replace brick strictly requires a server with enough free space to hold
the data of the old brick, whereas remove-brick will evenly spread out the
data of the bring being removed amongst the remaining servers.
- Replace-brick code is complex and messy (the real reason :p).
Wanted to see this reason as 1st point, but its ok as long as we mention
about this. I too agree that its _hard_ to maintain that piece of code.
Post by Anand Avati
- No clear reason why replace-brick's data migration is better in any way
to remove-brick's data migration.
One reason I heard when I sent the mail on gluster-devel earlier (
http://lists.nongnu.org/archive/html/gluster-devel/2012-10/msg00050.html )
was that the remove-brick way was bit slower than that of replace-brick.
Technical reason being remove-brick does DHT's readdir, where as
replace-brick does the brick level readdir.
Post by Anand Avati
I plan to send out patches to remove all traces of replace-brick data
migration code by 3.5 branch time.
Thanks for the initiative, let me know if you need help.
NOTE that replace-brick command itself will still exist, and you can
replace on server with another in case a server dies. It is only the data
migration functionality being phased out.
Yes, we need to be careful about this. We would need 'replace-brick' to
phase out a dead brick. The other day, there was some discussion on have
'gluster peer replace <old-peer> <new-peer>' which would re-write all the
vol files properly. But thats mostly for 3.6 time frame IMO.
Post by Anand Avati
Please do ask any questions / raise concerns at this stage :)
What is the window before you start sending out patches ?? I see
http://review.gluster.org/6010 which I guess is not totally complete
without phasing out pump xlator :-)

I personally am all in for this change, as it helps me to finish few more
enhancements I am working on like 'discover()' changes etc...

Regards,
Amar
Cool
2013-09-27 18:33:12 UTC
Permalink
How does the new command set achieve this?

old layout (2x2):
rep=2: h1:/b1 h2:/b1 h1:/b2 h2:/b2

new layout (3x2):
rep=2: h1:/b1 h2:/b1 h1:/b2 h3:/b1 h2:/b2 h3:/b2

purpose for the new layout is to make sure there is no SOF, as I cannot
simple add h3:/b1 and h3:/b2 as a pair.

With replace-brick it pretty straightforward, but without that ...
should I remove-brick h2:/b2 then add-brick h3:/b1? this means I'm going
to have only one copy for some data for a certain period of time, which
makes me feel nervous. Or, should I add-brick h3:/b1 first? That doesn't
seems to be reasonable either.

Or am I the only one hitting this kind of upgrade?

-C.B.
Post by Anand Avati
Hello all,
DHT's remove-brick + rebalance has been enhanced in the last
couple of releases to be quite sophisticated. It can handle
graceful decommissioning of bricks, including open file
descriptors and hard links.
Last set of patches for this should be reviewed and accepted before we
make that claim :-) [ http://review.gluster.org/5891 ]
This in a way is a feature overlap with replace-brick's data
migration functionality. Replace-brick's data migration is
currently also used for planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the
users and hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary,
because self-healing itself will recreate the data (replace-brick
actually uses self-heal internally)
- In a non-replicated config if a server is getting replaced by a
new one, add-brick <new> + remove-brick <old> "start" achieves the
same goal as replace-brick <old> <new> "start".
Should we phase out CLI of doing a 'remove-brick' without any option
too? because even if users do it by mistake, they would loose data. We
should enforce 'start' and then 'commit' usage of remove-brick. Also
if old method is required for anyone, they anyways have 'force' option.
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data) whereas
add-brick <new> + remove-brick <old> is completely transparent.
+10 (thats the number of bugs open on these things :-)
- Replace brick strictly requires a server with enough free space
to hold the data of the old brick, whereas remove-brick will
evenly spread out the data of the bring being removed amongst the
remaining servers.
- Replace-brick code is complex and messy (the real reason :p).
Wanted to see this reason as 1st point, but its ok as long as we
mention about this. I too agree that its _hard_ to maintain that piece
of code.
- No clear reason why replace-brick's data migration is better in
any way to remove-brick's data migration.
One reason I heard when I sent the mail on gluster-devel earlier
(http://lists.nongnu.org/archive/html/gluster-devel/2012-10/msg00050.html
) was that the remove-brick way was bit slower than that of
replace-brick. Technical reason being remove-brick does DHT's readdir,
where as replace-brick does the brick level readdir.
I plan to send out patches to remove all traces of replace-brick
data migration code by 3.5 branch time.
Thanks for the initiative, let me know if you need help.
NOTE that replace-brick command itself will still exist, and you
can replace on server with another in case a server dies. It is
only the data migration functionality being phased out.
Yes, we need to be careful about this. We would need 'replace-brick'
to phase out a dead brick. The other day, there was some discussion on
have 'gluster peer replace <old-peer> <new-peer>' which would re-write
all the vol files properly. But thats mostly for 3.6 time frame IMO.
Please do ask any questions / raise concerns at this stage :)
What is the window before you start sending out patches ?? I see
http://review.gluster.org/6010 which I guess is not totally complete
without phasing out pump xlator :-)
I personally am all in for this change, as it helps me to finish few
more enhancements I am working on like 'discover()' changes etc...
Regards,
Amar
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
Amar Tumballi
2013-09-30 09:46:37 UTC
Permalink
Post by Cool
How does the new command set achieve this?
rep=2: h1:/b1 h2:/b1 h1:/b2 h2:/b2
rep=2: h1:/b1 h2:/b1 h1:/b2 h3:/b1 h2:/b2 h3:/b2
purpose for the new layout is to make sure there is no SOF, as I
cannot simple add h3:/b1 and h3:/b2 as a pair.
With replace-brick it pretty straightforward, but without that ...
should I remove-brick h2:/b2 then add-brick h3:/b1? this means I'm
going to have only one copy for some data for a certain period of
time, which makes me feel nervous. Or, should I add-brick h3:/b1
first? That doesn't seems to be reasonable either.
Or am I the only one hitting this kind of upgrade?
No, you are not only one. This is the exact reason, we recommend adding
nodes in multiple of 2s.

Also, another recommendation is having directories exported and not the
mountpoint itself for bricks.

In your case, it would be (by following above best practice)

# gluster volume info test-vol:
rep=2: h1:/b1/d1 h2:/b1/d1 h1:/b2/d1 h2:/b2/d1

# gluster volume add-brick test-vol h1:/b2/d2 h3:/b1/d1 h2:/b2/d2 h3:/b2/d1
# gluster volume remove-brick test-vol h1:/b2/d1 h2:/b2/d1 start

# gluster volume remove-brick test-vol h1:/b2/d1 h2:/b2/d1 commit

# gluster volume info test-vol:
rep=2: h1:/b1/d1 h2:/b1/d1 h1:/b2/d2 h3:/b1/d1 h2:/b2/d2 h3:/b2/d1

Hope this works.

Regards,
Amar
Post by Cool
-C.B.
Post by Anand Avati
Hello all,
DHT's remove-brick + rebalance has been enhanced in the last
couple of releases to be quite sophisticated. It can handle
graceful decommissioning of bricks, including open file
descriptors and hard links.
Last set of patches for this should be reviewed and accepted before
we make that claim :-) [ http://review.gluster.org/5891 ]
This in a way is a feature overlap with replace-brick's data
migration functionality. Replace-brick's data migration is
currently also used for planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the
users and hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary,
because self-healing itself will recreate the data (replace-brick
actually uses self-heal internally)
- In a non-replicated config if a server is getting replaced by a
new one, add-brick <new> + remove-brick <old> "start" achieves
the same goal as replace-brick <old> <new> "start".
Should we phase out CLI of doing a 'remove-brick' without any option
too? because even if users do it by mistake, they would loose data.
We should enforce 'start' and then 'commit' usage of remove-brick.
Also if old method is required for anyone, they anyways have 'force'
option.
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data)
whereas add-brick <new> + remove-brick <old> is completely
transparent.
+10 (thats the number of bugs open on these things :-)
- Replace brick strictly requires a server with enough free space
to hold the data of the old brick, whereas remove-brick will
evenly spread out the data of the bring being removed amongst the
remaining servers.
- Replace-brick code is complex and messy (the real reason :p).
Wanted to see this reason as 1st point, but its ok as long as we
mention about this. I too agree that its _hard_ to maintain that
piece of code.
- No clear reason why replace-brick's data migration is better in
any way to remove-brick's data migration.
One reason I heard when I sent the mail on gluster-devel earlier
(http://lists.nongnu.org/archive/html/gluster-devel/2012-10/msg00050.html
) was that the remove-brick way was bit slower than that of
replace-brick. Technical reason being remove-brick does DHT's
readdir, where as replace-brick does the brick level readdir.
I plan to send out patches to remove all traces of replace-brick
data migration code by 3.5 branch time.
Thanks for the initiative, let me know if you need help.
NOTE that replace-brick command itself will still exist, and you
can replace on server with another in case a server dies. It is
only the data migration functionality being phased out.
Yes, we need to be careful about this. We would need 'replace-brick'
to phase out a dead brick. The other day, there was some discussion
on have 'gluster peer replace <old-peer> <new-peer>' which would
re-write all the vol files properly. But thats mostly for 3.6 time
frame IMO.
Please do ask any questions / raise concerns at this stage :)
What is the window before you start sending out patches ?? I see
http://review.gluster.org/6010 which I guess is not totally complete
without phasing out pump xlator :-)
I personally am all in for this change, as it helps me to finish few
more enhancements I am working on like 'discover()' changes etc...
Regards,
Amar
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-devel mailing list
https://lists.nongnu.org/mailman/listinfo/gluster-devel
Cool
2013-09-30 17:34:06 UTC
Permalink
Nice, thanks for the clarification.

-C.B.
Post by Amar Tumballi
Post by Cool
How does the new command set achieve this?
rep=2: h1:/b1 h2:/b1 h1:/b2 h2:/b2
rep=2: h1:/b1 h2:/b1 h1:/b2 h3:/b1 h2:/b2 h3:/b2
purpose for the new layout is to make sure there is no SOF, as I
cannot simple add h3:/b1 and h3:/b2 as a pair.
With replace-brick it pretty straightforward, but without that ...
should I remove-brick h2:/b2 then add-brick h3:/b1? this means I'm
going to have only one copy for some data for a certain period of
time, which makes me feel nervous. Or, should I add-brick h3:/b1
first? That doesn't seems to be reasonable either.
Or am I the only one hitting this kind of upgrade?
No, you are not only one. This is the exact reason, we recommend
adding nodes in multiple of 2s.
Also, another recommendation is having directories exported and not
the mountpoint itself for bricks.
In your case, it would be (by following above best practice)
rep=2: h1:/b1/d1 h2:/b1/d1 h1:/b2/d1 h2:/b2/d1
# gluster volume add-brick test-vol h1:/b2/d2 h3:/b1/d1 h2:/b2/d2 h3:/b2/d1
# gluster volume remove-brick test-vol h1:/b2/d1 h2:/b2/d1 start
# gluster volume remove-brick test-vol h1:/b2/d1 h2:/b2/d1 commit
rep=2: h1:/b1/d1 h2:/b1/d1 h1:/b2/d2 h3:/b1/d1 h2:/b2/d2 h3:/b2/d1
Hope this works.
Regards,
Amar
Post by Cool
-C.B.
Post by Anand Avati
Hello all,
DHT's remove-brick + rebalance has been enhanced in the last
couple of releases to be quite sophisticated. It can handle
graceful decommissioning of bricks, including open file
descriptors and hard links.
Last set of patches for this should be reviewed and accepted before
we make that claim :-) [ http://review.gluster.org/5891 ]
This in a way is a feature overlap with replace-brick's data
migration functionality. Replace-brick's data migration is
currently also used for planned decommissioning of a brick.
- There are two methods of moving data. It is confusing for the
users and hard for developers to maintain.
- If server being replaced is a member of a replica set, neither
remove-brick nor replace-brick data migration is necessary,
because self-healing itself will recreate the data (replace-brick
actually uses self-heal internally)
- In a non-replicated config if a server is getting replaced by a
new one, add-brick <new> + remove-brick <old> "start" achieves
the same goal as replace-brick <old> <new> "start".
Should we phase out CLI of doing a 'remove-brick' without any option
too? because even if users do it by mistake, they would loose data.
We should enforce 'start' and then 'commit' usage of remove-brick.
Also if old method is required for anyone, they anyways have 'force'
option.
- In a non-replicated config, <replace-brick> is NOT glitch free
(applications witness ENOTCONN if they are accessing data)
whereas add-brick <new> + remove-brick <old> is completely
transparent.
+10 (thats the number of bugs open on these things :-)
- Replace brick strictly requires a server with enough free space
to hold the data of the old brick, whereas remove-brick will
evenly spread out the data of the bring being removed amongst the
remaining servers.
- Replace-brick code is complex and messy (the real reason :p).
Wanted to see this reason as 1st point, but its ok as long as we
mention about this. I too agree that its _hard_ to maintain that
piece of code.
- No clear reason why replace-brick's data migration is better in
any way to remove-brick's data migration.
One reason I heard when I sent the mail on gluster-devel earlier
(http://lists.nongnu.org/archive/html/gluster-devel/2012-10/msg00050.html
) was that the remove-brick way was bit slower than that of
replace-brick. Technical reason being remove-brick does DHT's
readdir, where as replace-brick does the brick level readdir.
I plan to send out patches to remove all traces of replace-brick
data migration code by 3.5 branch time.
Thanks for the initiative, let me know if you need help.
NOTE that replace-brick command itself will still exist, and you
can replace on server with another in case a server dies. It is
only the data migration functionality being phased out.
Yes, we need to be careful about this. We would need 'replace-brick'
to phase out a dead brick. The other day, there was some discussion
on have 'gluster peer replace <old-peer> <new-peer>' which would
re-write all the vol files properly. But thats mostly for 3.6 time
frame IMO.
Please do ask any questions / raise concerns at this stage :)
What is the window before you start sending out patches ?? I see
http://review.gluster.org/6010 which I guess is not totally complete
without phasing out pump xlator :-)
I personally am all in for this change, as it helps me to finish few
more enhancements I am working on like 'discover()' changes etc...
Regards,
Amar
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-devel mailing list
https://lists.nongnu.org/mailman/listinfo/gluster-devel
Anand Avati
2013-09-30 22:13:18 UTC
Permalink
Post by Anand Avati
I plan to send out patches to remove all traces of replace-brick data
Post by Anand Avati
migration code by 3.5 branch time.
Thanks for the initiative, let me know if you need help.
I could use help here, if you have free cycles to pick up this task?

Avati
Amar Tumballi
2013-10-03 12:39:02 UTC
Permalink
Post by Anand Avati
Post by Anand Avati
I plan to send out patches to remove all traces of replace-brick data
Post by Anand Avati
migration code by 3.5 branch time.
Thanks for the initiative, let me know if you need help.
I could use help here, if you have free cycles to pick up this task?
Sure!
Cleanup in CLI/glusterd - http://review.gluster.org/6031
Continue reading on narkive:
Loading...