Discussion:
Gluster-users Digest, Vol 20, Issue 22
(too old to reply)
Larry Bates
2009-12-17 16:17:30 UTC
Permalink
Phi.l,

I think the real question you need to ask has to do with why we are
using GlusterFS at all and what happens when something fails. Normally
GlusterFS is used to provide scalability, redundancy/recovery, and
performance. For many applications performance will be the least of the
worries so we concentrate on scalability and redundancy/recovery.
Scalability can be achieved no matter which way you configure your
servers. Using distribute translator (DHT) you can unify all the
servers into a single virtual storage space. The problem comes when you
look at what happens when you have a machine/drive failures and need the
redundancy/recovery capabilities of GlusterFS. By putting 36Tb of
storage on a single server and exposing it as a single volume (using
either hardware or software RAID), you will have to replicate that to a
replacement server after a failure. Replicating 36Tb will take a lot of
time and CPU cycles. If you keep things simple (JBOD) and use AFR to
replicate drives between servers and use DHT to unify everything
together, now you only have to move 1.5Tb/2Tb when a drive fails. You
will also note that you get to use 100% of your disk storage this way
instead of wasting 1 drive per array with RAID5 or two drives with
RAID6. Normally with RAID5/6 it is also imperative that you have a hot
spare per array, which means you waste an additional driver per array.
To make RAID5/6 work with no single point of failure you have to do
something like RAID50/60 across two controllers which gets expensive and
much more difficult to manage and to grow. Implementing GlusterFS using
more modest hardware makes all those "issues" go away. Just use
GlusterFS to provide the RAID-like capabilities (via AFR and DHT).

Personally I doubt that I would set up my storage the way you describe.
I probably would (and have) set it up with more smaller servers.
Something like three times as many 2U servers with 8x2Tb drives each (or
even 6 times as many 1U servers with 4x2Tb drives each) and forget the
expensive RAID SATA controllers, they aren't necessary and are just a
single point of failure that you can eliminate. In addition you will
enjoy significant performance improvements because you have:

1) Many parallel paths to storage (36x1U or 18x2U vs 6x5U servers).
Gigabit Ethernet is fast, but still will limit bandwidth to a single
machine.
2) Write performance on RAID5/6 is never going to be as fast as JBOD.
3) You should have much more memory caching available (36x8Gb = 256Gb
memory or 18x8Gb memory = 128Gb vs maybe 6x16Gb = 96Gb)
4) Management of the storage is done in one place..GlusterFS. No messy
RAID controller setups to document/remember.
5) You can expand in the future in a much more granular and controlled
fashion. Add 2 machines (1 for replication) and you get 8Tb (using 2Tb
drives) of storage. When you want to replace a machine, just set up new
one, fail the old one, and let GlusterFS build the new one for you (AFR
will do the heavy lifting). CPUs will get faster, hard drives will get
faster and bigger in the future, so make it easy to upgrade. A small
number of BIG machines makes it a lot harder to do upgrades as new
hardware becomes available.
6) Machine failures (motherboard, power supply, etc.) will effect much
less of your storage network. Having a spare 1U machine around as a hot
spare doesn't cost much (maybe $1200). Having a spare 5U monster around
does (probably close to $6000).

IMHO 36 x 1U or 18 x 2U servers shouldn't cost any more (and maybe less)
than the big boxes you are looking to buy. They are commodity items.
If you go the 1U route you don't need anything but a machine, with
memory and 4 hard drives (all server motherboards come with at least 4
SATA ports). By using 2Tb drives, I think you would find that the cost
would be actually less. By NOT using hardware RAID you can also NOT use
RAID-class hard drives which cost about $100 each more than non-RAID
hard drives. Just that change alone could save you 6 x 24 = 144 x $100
= $14,400! JBOD just doesn't need RAID-class hard drives because you
don't need the sophisticated firmware that the RAID-class hard drives
provide. You still will want quality hard drives, but failures will
have such a low impact that it is much less of a problem.

By using more smaller machines you also eliminate the need for redundant
power supplies (which would be a requirement in your large boxes because
it would be a single point of failure on a large percentage of your
storage system).

Hope the information helps.

Regards,
Larry Bates


------------------------------
Message: 6
Date: Thu, 17 Dec 2009 00:18:54 -0600
Subject: [Gluster-users] Recommended GlusterFS configuration for 6
node cluster
Content-Type: text/plain; charset=UTF-8
We're setting up 6 servers, each with 24 x 1.5TB drives, the systems
will run Debian testing and Gluster 3.x. The SATA RAID card offers
RAID5 and RAID6, we're wondering what the optimum setup would be for
this configuration. Do we RAID5 the disks, and have GlusterFS use
them that way, or do we keep them all 'raw' and have GlusterFS handle
the replication (though not 2x as we would have with the RAID
options)? Obviously a lot of ways to do this, just wondering what
GlusterFS devs and other experienced users would recommend.
Thanks
P
Tejas N. Bhise
2009-12-17 17:23:57 UTC
Permalink
Thanks, Larry, for the comprehensive information.

Phil, I hope that answers a lot of your questions. Feel free to ask more, we have a great community here.

Regards,
Tejas.

----- Original Message -----
From: "Larry Bates" <***@vitalesafe.com>
To: gluster-***@gluster.org, ***@cryer.us
Sent: Thursday, December 17, 2009 9:47:30 PM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi
Subject: Re: [Gluster-users] Gluster-users Digest, Vol 20, Issue 22

Phi.l,

I think the real question you need to ask has to do with why we are
using GlusterFS at all and what happens when something fails. Normally
GlusterFS is used to provide scalability, redundancy/recovery, and
performance. For many applications performance will be the least of the
worries so we concentrate on scalability and redundancy/recovery.
Scalability can be achieved no matter which way you configure your
servers. Using distribute translator (DHT) you can unify all the
servers into a single virtual storage space. The problem comes when you
look at what happens when you have a machine/drive failures and need the
redundancy/recovery capabilities of GlusterFS. By putting 36Tb of
storage on a single server and exposing it as a single volume (using
either hardware or software RAID), you will have to replicate that to a
replacement server after a failure. Replicating 36Tb will take a lot of
time and CPU cycles. If you keep things simple (JBOD) and use AFR to
replicate drives between servers and use DHT to unify everything
together, now you only have to move 1.5Tb/2Tb when a drive fails. You
will also note that you get to use 100% of your disk storage this way
instead of wasting 1 drive per array with RAID5 or two drives with
RAID6. Normally with RAID5/6 it is also imperative that you have a hot
spare per array, which means you waste an additional driver per array.
To make RAID5/6 work with no single point of failure you have to do
something like RAID50/60 across two controllers which gets expensive and
much more difficult to manage and to grow. Implementing GlusterFS using
more modest hardware makes all those "issues" go away. Just use
GlusterFS to provide the RAID-like capabilities (via AFR and DHT).

Personally I doubt that I would set up my storage the way you describe.
I probably would (and have) set it up with more smaller servers.
Something like three times as many 2U servers with 8x2Tb drives each (or
even 6 times as many 1U servers with 4x2Tb drives each) and forget the
expensive RAID SATA controllers, they aren't necessary and are just a
single point of failure that you can eliminate. In addition you will
enjoy significant performance improvements because you have:

1) Many parallel paths to storage (36x1U or 18x2U vs 6x5U servers).
Gigabit Ethernet is fast, but still will limit bandwidth to a single
machine.
2) Write performance on RAID5/6 is never going to be as fast as JBOD.
3) You should have much more memory caching available (36x8Gb = 256Gb
memory or 18x8Gb memory = 128Gb vs maybe 6x16Gb = 96Gb)
4) Management of the storage is done in one place..GlusterFS. No messy
RAID controller setups to document/remember.
5) You can expand in the future in a much more granular and controlled
fashion. Add 2 machines (1 for replication) and you get 8Tb (using 2Tb
drives) of storage. When you want to replace a machine, just set up new
one, fail the old one, and let GlusterFS build the new one for you (AFR
will do the heavy lifting). CPUs will get faster, hard drives will get
faster and bigger in the future, so make it easy to upgrade. A small
number of BIG machines makes it a lot harder to do upgrades as new
hardware becomes available.
6) Machine failures (motherboard, power supply, etc.) will effect much
less of your storage network. Having a spare 1U machine around as a hot
spare doesn't cost much (maybe $1200). Having a spare 5U monster around
does (probably close to $6000).

IMHO 36 x 1U or 18 x 2U servers shouldn't cost any more (and maybe less)
than the big boxes you are looking to buy. They are commodity items.
If you go the 1U route you don't need anything but a machine, with
memory and 4 hard drives (all server motherboards come with at least 4
SATA ports). By using 2Tb drives, I think you would find that the cost
would be actually less. By NOT using hardware RAID you can also NOT use
RAID-class hard drives which cost about $100 each more than non-RAID
hard drives. Just that change alone could save you 6 x 24 = 144 x $100
= $14,400! JBOD just doesn't need RAID-class hard drives because you
don't need the sophisticated firmware that the RAID-class hard drives
provide. You still will want quality hard drives, but failures will
have such a low impact that it is much less of a problem.

By using more smaller machines you also eliminate the need for redundant
power supplies (which would be a requirement in your large boxes because
it would be a single point of failure on a large percentage of your
storage system).

Hope the information helps.

Regards,
Larry Bates


------------------------------
Message: 6
Date: Thu, 17 Dec 2009 00:18:54 -0600
Subject: [Gluster-users] Recommended GlusterFS configuration for 6
node cluster
Content-Type: text/plain; charset=UTF-8
We're setting up 6 servers, each with 24 x 1.5TB drives, the systems
will run Debian testing and Gluster 3.x. The SATA RAID card offers
RAID5 and RAID6, we're wondering what the optimum setup would be for
this configuration. Do we RAID5 the disks, and have GlusterFS use
them that way, or do we keep them all 'raw' and have GlusterFS handle
the replication (though not 2x as we would have with the RAID
options)? Obviously a lot of ways to do this, just wondering what
GlusterFS devs and other experienced users would recommend.
Thanks
P
phil cryer
2010-01-05 16:21:12 UTC
Permalink
This is *very* helpful, thanks for taking the time Larry! Looking
forward to giving feedback once we have the cluster up.

P
Post by Tejas N. Bhise
Thanks, Larry, for the comprehensive information.
Phil, I hope that answers a lot of your questions. Feel free to ask more, we have a great community here.
Regards,
Tejas.
----- Original Message -----
Sent: Thursday, December 17, 2009 9:47:30 PM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi
Subject: Re: [Gluster-users] Gluster-users Digest, Vol 20, Issue 22
Phi.l,
I think the real question you need to ask has to do with why we are
using GlusterFS at all and what happens when something fails.  Normally
GlusterFS is used to provide scalability, redundancy/recovery, and
performance.  For many applications performance will be the least of the
worries so we concentrate on scalability and redundancy/recovery.
Scalability can be achieved no matter which way you configure your
servers.  Using distribute translator (DHT) you can unify all the
servers into a single virtual storage space.  The problem comes when you
look at what happens when you have a machine/drive failures and need the
redundancy/recovery capabilities of GlusterFS.  By putting 36Tb of
storage on a single server and exposing it as a single volume (using
either hardware or software RAID), you will have to replicate that to a
replacement server after a failure.  Replicating 36Tb will take a lot of
time and CPU cycles.  If you keep things simple (JBOD) and use AFR to
replicate drives between servers and use DHT to unify everything
together, now you only have to move 1.5Tb/2Tb when a drive fails.  You
will also note that you get to use 100% of your disk storage this way
instead of wasting 1 drive per array with RAID5 or two drives with
RAID6.  Normally with RAID5/6 it is also imperative that you have a hot
spare per array, which means you waste an additional driver per array.
To make RAID5/6 work with no single point of failure you have to do
something like RAID50/60 across two controllers which gets expensive and
much more difficult to manage and to grow.  Implementing GlusterFS using
more modest hardware makes all those "issues" go away.  Just use
GlusterFS to provide the RAID-like capabilities (via AFR and DHT).
Personally I doubt that I would set up my storage the way you describe.
I probably would (and have) set it up with more smaller servers.
Something like three times as many 2U servers with 8x2Tb drives each (or
even 6 times as many 1U servers with 4x2Tb drives each) and forget the
expensive RAID SATA controllers, they aren't necessary and are just a
single point of failure that you can eliminate.  In addition you will
1) Many parallel paths to storage (36x1U or 18x2U vs 6x5U servers).
Gigabit Ethernet is fast, but still will limit bandwidth to a single
machine.
2) Write performance on RAID5/6 is never going to be as fast as JBOD.
3) You should have much more memory caching available (36x8Gb = 256Gb
memory or 18x8Gb memory = 128Gb vs maybe 6x16Gb = 96Gb)
4) Management of the storage is done in one place..GlusterFS.  No messy
RAID controller setups to document/remember.
5) You can expand in the future in a much more granular and controlled
fashion.  Add 2 machines (1 for replication) and you get 8Tb (using 2Tb
drives) of storage.  When you want to replace a machine, just set up new
one, fail the old one, and let GlusterFS build the new one for you (AFR
will do the heavy lifting).  CPUs will get faster, hard drives will get
faster and bigger in the future, so make it easy to upgrade.  A small
number of BIG machines makes it a lot harder to do upgrades as new
hardware becomes available.
6) Machine failures (motherboard, power supply, etc.) will effect much
less of your storage network.  Having a spare 1U machine around as a hot
spare doesn't cost much (maybe $1200).  Having a spare 5U monster around
does (probably close to $6000).
IMHO 36 x 1U or 18 x 2U servers shouldn't cost any more (and maybe less)
than the big boxes you are looking to buy.  They are commodity items.
If you go the 1U route you don't need anything but a machine, with
memory and 4 hard drives (all server motherboards come with at least 4
SATA ports).  By using 2Tb drives, I think you would find that the cost
would be actually less.  By NOT using hardware RAID you can also NOT use
RAID-class hard drives which cost about $100 each more than non-RAID
hard drives.  Just that change alone could save you 6 x 24 = 144 x $100
= $14,400!  JBOD just doesn't need RAID-class hard drives because you
don't need the sophisticated firmware that the RAID-class hard drives
provide.  You still will want quality hard drives, but failures will
have such a low impact that it is much less of a problem.
By using more smaller machines you also eliminate the need for redundant
power supplies (which would be a requirement in your large boxes because
it would be a single point of failure on a large percentage of your
storage system).
Hope the information helps.
Regards,
Larry Bates
------------------------------
Message: 6
Date: Thu, 17 Dec 2009 00:18:54 -0600
Subject: [Gluster-users] Recommended GlusterFS configuration for 6
      node    cluster
Content-Type: text/plain; charset=UTF-8
We're setting up 6 servers, each with 24 x 1.5TB drives, the systems
will run Debian testing and Gluster 3.x.  The SATA RAID card offers
RAID5 and RAID6, we're wondering what the optimum setup would be for
this configuration.  Do we RAID5 the disks, and have GlusterFS use
them that way, or do we keep them all 'raw' and have GlusterFS handle
the replication (though not 2x as we would have with the RAID
options)?  Obviously a lot of ways to do this, just wondering what
GlusterFS devs and other experienced users would recommend.
Thanks
P
_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
--
http://philcryer.com
Liam Slusser
2010-01-05 18:00:59 UTC
Permalink
Larry & All,

I would much rather rebuild a bad drive with a raid controller then
have to wait for Gluster to do it. With a large number of files doing
a ls -aglR can take weeks. Also you don't NEED enterprise drives with
a raid controller, i use desktop 1.5tb Seagate drives which happy as a
clam on a 3ware SAS card under a SAS expander.

liam
Post by Larry Bates
Phi.l,
I think the real question you need to ask has to do with why we are using
GlusterFS at all and what happens when something fails.  Normally GlusterFS
is used to provide scalability, redundancy/recovery, and performance.  For
many applications performance will be the least of the worries so we
concentrate on scalability and redundancy/recovery.  Scalability can be
achieved no matter which way you configure your servers.  Using distribute
translator (DHT) you can unify all the servers into a single virtual storage
space.  The problem comes when you look at what happens when you have a
machine/drive failures and need the redundancy/recovery capabilities of
GlusterFS.  By putting 36Tb of storage on a single server and exposing it as
a single volume (using either hardware or software RAID), you will have to
replicate that to a replacement server after a failure.  Replicating 36Tb
will take a lot of time and CPU cycles.  If you keep things simple (JBOD)
and use AFR to replicate drives between servers and use DHT to unify
everything together, now you only have to move 1.5Tb/2Tb when a drive fails.
 You will also note that you get to use 100% of your disk storage this way
instead of wasting 1 drive per array with RAID5 or two drives with RAID6.
 Normally with RAID5/6 it is also imperative that you have a hot spare per
array, which means you waste an additional driver per array.  To make
RAID5/6 work with no single point of failure you have to do something like
RAID50/60 across two controllers which gets expensive and much more
difficult to manage and to grow.  Implementing GlusterFS using more modest
hardware makes all those "issues" go away.  Just use GlusterFS to provide
the RAID-like capabilities (via AFR and DHT).
Personally I doubt that I would set up my storage the way you describe.  I
probably would (and have) set it up with more smaller servers.  Something
like three times as many 2U servers with 8x2Tb drives each (or even 6 times
as many 1U servers with 4x2Tb drives each) and forget the expensive RAID
SATA controllers, they aren't necessary and are just a single point of
failure that you can eliminate.  In addition you will enjoy significant
1) Many parallel paths to storage (36x1U or 18x2U vs 6x5U servers).  Gigabit
Ethernet is fast, but still will limit bandwidth to a single machine.
2) Write performance on RAID5/6 is never going to be as fast as JBOD.
3) You should have much more memory caching available (36x8Gb = 256Gb memory
or 18x8Gb memory = 128Gb vs maybe 6x16Gb = 96Gb)
4) Management of the storage is done in one place..GlusterFS.  No messy RAID
controller setups to document/remember.
5) You can expand in the future in a much more granular and controlled
fashion.  Add 2 machines (1 for replication) and you get 8Tb (using 2Tb
drives) of storage.  When you want to replace a machine, just set up new
one, fail the old one, and let GlusterFS build the new one for you (AFR will
do the heavy lifting).  CPUs will get faster, hard drives will get faster
and bigger in the future, so make it easy to upgrade.  A small number of BIG
machines makes it a lot harder to do upgrades as new hardware becomes
available.
6) Machine failures (motherboard, power supply, etc.) will effect much less
of your storage network.  Having a spare 1U machine around as a hot spare
doesn't cost much (maybe $1200).  Having a spare 5U monster around does
(probably close to $6000).
IMHO 36 x 1U or 18 x 2U servers shouldn't cost any more (and maybe less)
than the big boxes you are looking to buy.  They are commodity items.  If
you go the 1U route you don't need anything but a machine, with memory and 4
hard drives (all server motherboards come with at least 4 SATA ports).  By
using 2Tb drives, I think you would find that the cost would be actually
less.  By NOT using hardware RAID you can also NOT use RAID-class hard
drives which cost about $100 each more than non-RAID hard drives.  Just that
change alone could save you 6 x 24 = 144 x $100 = $14,400!  JBOD just
doesn't need RAID-class hard drives because you don't need the sophisticated
firmware that the RAID-class hard drives provide.  You still will want
quality hard drives, but failures will have such a low impact that it is
much less of a problem.
By using more smaller machines you also eliminate the need for redundant
power supplies (which would be a requirement in your large boxes because it
would be a single point of failure on a large percentage of your storage
system).
Hope the information helps.
Regards,
Larry Bates
------------------------------
Message: 6
Date: Thu, 17 Dec 2009 00:18:54 -0600
Subject: [Gluster-users] Recommended GlusterFS configuration for 6
       node    cluster
Content-Type: text/plain; charset=UTF-8
We're setting up 6 servers, each with 24 x 1.5TB drives, the systems
will run Debian testing and Gluster 3.x.  The SATA RAID card offers
RAID5 and RAID6, we're wondering what the optimum setup would be for
this configuration.  Do we RAID5 the disks, and have GlusterFS use
them that way, or do we keep them all 'raw' and have GlusterFS handle
the replication (though not 2x as we would have with the RAID
options)?  Obviously a lot of ways to do this, just wondering what
GlusterFS devs and other experienced users would recommend.
Thanks
P
_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Arvids Godjuks
2010-01-05 22:17:20 UTC
Permalink
Consider this - a rebuild of 1.5-2 TB HDD in raid5/6 array can easily
take up to few days to complete. At that moment your storage at that
node will not perform well. I read a week ago very good article with
research of this area, only thing it's in russian, but it mentions a
few english sources too. Maybe google translate will help.
Here's the original link: http://habrahabr.ru/blogs/hardware/78311/
Here's the google translate version:
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fhabrahabr.ru%2Fblogs%2Fhardware%2F78311%2F&sl=ru&tl=en
(looks quite neet by the way)
Post by Liam Slusser
Larry & All,
I would much rather rebuild a bad drive with a raid controller then
have to wait for Gluster to do it.  With a large number of files doing
a ls -aglR can take weeks.  Also you don't NEED enterprise drives with
a raid controller, i use desktop 1.5tb Seagate drives which happy as a
clam on a 3ware SAS card under a SAS expander.
liam
Konstantin Sharlaimov
2010-01-06 01:10:46 UTC
Permalink
Author is exaggerating. We recover 6 TB RAID-5 on desktop-class hardware
in less then 6 hours. Our RAID is controlled by LSR (Linux Software
RAID). Performance is not good while rebuilding a single node, but
GlusterFS replicate/distribute translators help.
Post by Arvids Godjuks
Consider this - a rebuild of 1.5-2 TB HDD in raid5/6 array can easily
take up to few days to complete. At that moment your storage at that
node will not perform well. I read a week ago very good article with
research of this area, only thing it's in russian, but it mentions a
few english sources too. Maybe google translate will help.
Here's the original link: http://habrahabr.ru/blogs/hardware/78311/
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fhabrahabr.ru%2Fblogs%2Fhardware%2F78311%2F&sl=ru&tl=en
(looks quite neet by the way)
Post by Liam Slusser
Larry & All,
I would much rather rebuild a bad drive with a raid controller then
have to wait for Gluster to do it. With a large number of files doing
a ls -aglR can take weeks. Also you don't NEED enterprise drives with
a raid controller, i use desktop 1.5tb Seagate drives which happy as a
clam on a 3ware SAS card under a SAS expander.
liam
_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Liam Slusser
2010-01-06 01:37:09 UTC
Permalink
Arvids & Larry,

Interesting read, Arvids. And i completely agree. On our large raid6
array it takes 1 week to rebuild the array from any ONE drive failure.
Its a scary time when doing a rebuild because of the decreased
performance from the array and the increased chance of a full raid
failure if we lost another two drives. Makes for a very long week of
nail biting.

Larry brought up some great points. I, also, have been burned way to
many times by raid5 and only use it if i absolutely have to. I
normally stick to raid1/10 or raid6/60. Even with my huge raid6
rebuild time of a week, its still faster to do that then have gluster
resync everything. The raid rebuild does affect the performance of
the box, however, so would a gluster rebuild.

As for Larry's point #4 i duplicate the data across two boxes using
cluster/replication on top of raid6. So each box has a large raid6
set and then dup the data between the two. So, for whatever reason,
if i did loose a whole raid array i can still recover with Gluster.

I've also been frowned on for using desktop drives in our servers -
but on the bright side i've had very little problems with them. Of
course it did take buying a bunch of different raid cards and drives
before finding a combination that played well together. We currently
have 240 Seagate 1.5tb desktop drives in our two gluster clusters and
have only had to replaced three in the last year - two that just died
and one started to get smart errors so it was replaced. I haven't had
a problem getting Seagate to replace the drives - as they fail i ship
them off to Seagate and they send me a new one. I did figure we would
have to do support in house so we bought lots of spare parts when we
ordered everything. It was still way cheaper to buy desktop drives
and Supermicro servers with lots of spare parts than shopping at Dell,
HP or Sun - by more than half.

Honestly my biggest peeve of Gluster is the rebuild process. Take the
OneFS file system in Isilon clusters - they are able to rebuild at a
block level - only replicating information which has changed. So even
with one node being offline all day - a rebuild/resync operation is
very quick. And have 30 billion files or 10 huge ones makes no
difference on resync speed. While with Gluster a huge directory
tree/number of files can take days if not weeks to finish. Of course
being that Gluster runs on top of a normal filesystem such as
xfs/ext3/zfs having access to block level replication may be tricky.
I honestly would not be against having the Gluster team modifying the
xfs/ext3/whatever filesystem so they could tailer it more for their
own needs - which of course would make it far less portable and much
more difficult to install and configure...

Whatever the solution is i can tell you that the rebuild issues will
only get worse as drives continue to get larger and the number of
files/directories continue to grow. Sun's ZFS filesystem goes along
way to fix some of these problems, i just wish they would port it over
to Linux.

liam
Post by Arvids Godjuks
Consider this - a rebuild of 1.5-2 TB HDD in raid5/6 array can easily
take up to few days to complete. At that moment your storage at that
node will not perform well. I read a week ago very good article with
research of this area, only thing it's in russian, but it mentions a
few english sources too. Maybe google translate will help.
Here's the original link: http://habrahabr.ru/blogs/hardware/78311/
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fhabrahabr.ru%2Fblogs%2Fhardware%2F78311%2F&sl=ru&tl=en
(looks quite neet by the way)
Post by Liam Slusser
Larry & All,
I would much rather rebuild a bad drive with a raid controller then
have to wait for Gluster to do it.  With a large number of files doing
a ls -aglR can take weeks.  Also you don't NEED enterprise drives with
a raid controller, i use desktop 1.5tb Seagate drives which happy as a
clam on a 3ware SAS card under a SAS expander.
liam
Arvids Godjuks
2010-01-06 02:28:25 UTC
Permalink
Post by Liam Slusser
Arvids & Larry,
Interesting read, Arvids. ...
Well, I think JBOD style makes it much more less painfull loosing a
disk than a whole RAID array. Single disk will be restored far faster
than a whole raid array. Especially if you can do hotswap of your
disks, well yes, hardware should support this ofcourse.
Anyway, I do think that a correct combination of AFR and DHT will make
up for that disk loss. If only Gluster could do data relocation as a
node/disk goes offline.
Post by Liam Slusser
Konstantin Sharlaimov
RAID-6 isn't raid-5. Raid-6 has more parity control disks, that means
it can do paralel reads from more than one disk and performance
doesn't degrade as much as with raid-5. That's why I think you get
your rstore fast - your disk is able to recive data fast enought to do
writes at high speed: actualy 8 hours for 1.5TB is about 72MB/sec
average write speed - not many disks can keep up such speeds all the
time.
Harshavardhana
2010-01-06 03:48:48 UTC
Permalink
Hi Liam,

* replies inline *
Post by Liam Slusser
Arvids & Larry,
Interesting read, Arvids. And i completely agree. On our large raid6
array it takes 1 week to rebuild the array from any ONE drive failure.
Its a scary time when doing a rebuild because of the decreased
performance from the array and the increased chance of a full raid
failure if we lost another two drives. Makes for a very long week of
nail biting.
Larry brought up some great points. I, also, have been burned way to
many times by raid5 and only use it if i absolutely have to. I
normally stick to raid1/10 or raid6/60. Even with my huge raid6
rebuild time of a week, its still faster to do that then have gluster
resync everything. The raid rebuild does affect the performance of
the box, however, so would a gluster rebuild.
As for Larry's point #4 i duplicate the data across two boxes using
cluster/replication on top of raid6. So each box has a large raid6
set and then dup the data between the two. So, for whatever reason,
if i did loose a whole raid array i can still recover with Gluster.
I've also been frowned on for using desktop drives in our servers -
but on the bright side i've had very little problems with them. Of
course it did take buying a bunch of different raid cards and drives
before finding a combination that played well together. We currently
have 240 Seagate 1.5tb desktop drives in our two gluster clusters and
have only had to replaced three in the last year - two that just died
and one started to get smart errors so it was replaced. I haven't had
a problem getting Seagate to replace the drives - as they fail i ship
them off to Seagate and they send me a new one. I did figure we would
have to do support in house so we bought lots of spare parts when we
ordered everything. It was still way cheaper to buy desktop drives
and Supermicro servers with lots of spare parts than shopping at Dell,
HP or Sun - by more than half.
Honestly my biggest peeve of Gluster is the rebuild process. Take the
OneFS file system in Isilon clusters - they are able to rebuild at a
block level - only replicating information which has changed. So even
with one node being offline all day - a rebuild/resync operation is
very quick. And have 30 billion files or 10 huge ones makes no
difference on resync speed. While with Gluster a huge directory
tree/number of files can take days if not weeks to finish. Of course
being that Gluster runs on top of a normal filesystem such as
xfs/ext3/zfs having access to block level replication may be tricky.
I honestly would not be against having the Gluster team modifying the
xfs/ext3/whatever filesystem so they could tailer it more for their
own needs - which of course would make it far less portable and much
more difficult to install and configure...
GlusterFS does checksum based self-heal since the 3.0 release, i would
believe your experiences are from 2.0? which has issues of doing a full file
self-heal which will a lot of time. But i would suggest an upgrade with
3.0.1
release which is due Feb 1st week for your cluster. 3.x releases with new
self-heal you should get very less rebuild times. If its possible to compare
the
3.0.1 rebuild times with the One-FS from Isilon should help us improve it
too.

Thanks
Post by Liam Slusser
Whatever the solution is i can tell you that the rebuild issues will
only get worse as drives continue to get larger and the number of
files/directories continue to grow. Sun's ZFS filesystem goes along
way to fix some of these problems, i just wish they would port it over
to Linux.
I would suggest wait for "brtfs".
Post by Liam Slusser
liam
Post by Arvids Godjuks
Consider this - a rebuild of 1.5-2 TB HDD in raid5/6 array can easily
take up to few days to complete. At that moment your storage at that
node will not perform well. I read a week ago very good article with
research of this area, only thing it's in russian, but it mentions a
few english sources too. Maybe google translate will help.
Here's the original link: http://habrahabr.ru/blogs/hardware/78311/
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fhabrahabr.ru%2Fblogs%2Fhardware%2F78311%2F&sl=ru&tl=en
Post by Arvids Godjuks
(looks quite neet by the way)
Post by Liam Slusser
Larry & All,
I would much rather rebuild a bad drive with a raid controller then
have to wait for Gluster to do it. With a large number of files doing
a ls -aglR can take weeks. Also you don't NEED enterprise drives with
a raid controller, i use desktop 1.5tb Seagate drives which happy as a
clam on a 3ware SAS card under a SAS expander.
liam
_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Liam Slusser
2010-01-06 05:47:36 UTC
Permalink
Yeah im waiting for Gluster to come out with a 3.0.1 release before i
upgrade. I'll make sure to do my best to compare 3.0.1 with OneFS's
performance/recovery/etc once i upgrade. I still have two Isilon
clusters which aren't in production anymore in our lab i can play
around with.

And i've been waiting for brfs for awhile now, it can't come soon enough!

thanks,
liam
Post by Harshavardhana
Hi Liam,
GlusterFS does checksum based self-heal since the 3.0 release, i would
believe your experiences are from 2.0? which has issues of doing a full file
self-heal which will a lot of time.  But i would suggest an upgrade with
3.0.1
release which is due Feb 1st week for your cluster. 3.x releases with new
self-heal you should get very less rebuild times. If its possible to compare
the
3.0.1 rebuild times with the One-FS from Isilon should help us improve it
too.
Thanks
\>
Post by Harshavardhana
I would suggest wait for "brtfs".
Continue reading on narkive:
Loading...