Discussion:
[Gluster-users] JBOD / ZFS / Flash backed
Vincent Royer
2018-04-04 19:41:57 UTC
Permalink
Hi,

Trying to make the most of a limited budget. I need fast I/O for
operations under 4MB, and high availability of VMs in an Ovirt cluster.

I Have 3 nodes running Ovirt and want to rebuild them with hardware for
converging storage.

Should I use 2 960GB SSDs in RAID1 in each node, replica 3?

Or can I get away with 1 larger SSD per node, JBOD, replica 3?

Is a flash-backed Raid required for JBOD, and should it be 1gb, 2, or 4gb
flash?

Storage network will be 10gbe.

Enterprise SSDs and Flash-backed raid is very expensive, so I want to
ensure the investment will provide best value in terms of capacity,
performance, and availability.

Thanks,

Vincent
Alex Chekholko
2018-04-04 19:49:47 UTC
Permalink
Based on your message, it sounds like your total usable capacity
requirement is around <1TB. With a modern SSD, you'll get something like
40k theoretical IOPs for 4k I/O size.

You don't mention budget. What is your budget? You mention "4MB
operations", where is that requirement coming from?
Post by Vincent Royer
Hi,
Trying to make the most of a limited budget. I need fast I/O for
operations under 4MB, and high availability of VMs in an Ovirt cluster.
I Have 3 nodes running Ovirt and want to rebuild them with hardware for
converging storage.
Should I use 2 960GB SSDs in RAID1 in each node, replica 3?
Or can I get away with 1 larger SSD per node, JBOD, replica 3?
Is a flash-backed Raid required for JBOD, and should it be 1gb, 2, or 4gb
flash?
Storage network will be 10gbe.
Enterprise SSDs and Flash-backed raid is very expensive, so I want to
ensure the investment will provide best value in terms of capacity,
performance, and availability.
Thanks,
Vincent
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Vincent Royer
2018-04-04 19:58:46 UTC
Permalink
thanks for your reply,

Yes the VMs are very small and provide only a single service, I would
prefer a total of 2TB but 1TB to start is sufficient. Ideally I want a
scheme that is easy to expand by dropping an extra disk in each node. When
all slots are full, add another node.

our current setup accesses a storage share via NFS, most read/write
operations under load are under 4MB. There isn't any long sequential I/O.

Currently we have 2 nodes, I am specing the 3rd and adding necessary
components to the existing. Budget around $20k for the upgrade.
Post by Alex Chekholko
Based on your message, it sounds like your total usable capacity
requirement is around <1TB. With a modern SSD, you'll get something like
40k theoretical IOPs for 4k I/O size.
You don't mention budget. What is your budget? You mention "4MB
operations", where is that requirement coming from?
Post by Vincent Royer
Hi,
Trying to make the most of a limited budget. I need fast I/O for
operations under 4MB, and high availability of VMs in an Ovirt cluster.
I Have 3 nodes running Ovirt and want to rebuild them with hardware for
converging storage.
Should I use 2 960GB SSDs in RAID1 in each node, replica 3?
Or can I get away with 1 larger SSD per node, JBOD, replica 3?
Is a flash-backed Raid required for JBOD, and should it be 1gb, 2, or 4gb
flash?
Storage network will be 10gbe.
Enterprise SSDs and Flash-backed raid is very expensive, so I want to
ensure the investment will provide best value in terms of capacity,
performance, and availability.
Thanks,
Vincent
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Vincent Royer
2018-04-09 15:49:17 UTC
Permalink
Post by Vincent Royer
Is a flash-backed Raid required for JBOD, and should it be 1gb, 2, or 4gb
flash?
Is anyone able to clarify this requirement for me?
Alex Chekholko
2018-04-09 17:34:55 UTC
Permalink
Your question is difficult to parse. Typically RAID and JBOD are mutually
exclusive. By "flash-backed", do you mean a battery backup unit (BBU) on
your RAID controller?
Post by Vincent Royer
Post by Vincent Royer
Is a flash-backed Raid required for JBOD, and should it be 1gb, 2, or 4gb
flash?
Is anyone able to clarify this requirement for me?
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Vincent Royer
2018-04-09 18:00:08 UTC
Permalink
Yes the flash-backed RAID cards use a super-capacitor to backup the flash
cache. You have a choice of flash module sizes to include on the card.
The card supports RAID modes as well as JBOD.

I do not know if Gluster can make use of battery-backed flash-based Cache
when the disks are presented by the RAID card in JBOD. The Hardware
vendor asked "Do you know if Gluster makes use of flash-cache in JBOD?"

If it does, I'm not certain how the size of this flash cache affects the
operation.

*Vincent Royer*
*778-825-1057*


<http://www.epicenergy.ca/>
*SUSTAINABLE MOBILE ENERGY SOLUTIONS*
Post by Alex Chekholko
Your question is difficult to parse. Typically RAID and JBOD are mutually
exclusive. By "flash-backed", do you mean a battery backup unit (BBU) on
your RAID controller?
Post by Vincent Royer
Post by Vincent Royer
Is a flash-backed Raid required for JBOD, and should it be 1gb, 2, or
4gb flash?
Is anyone able to clarify this requirement for me?
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Alex Crow
2018-04-09 18:20:35 UTC
Permalink
Post by Vincent Royer
Yes the flash-backed RAID cards use a super-capacitor to backup the
flash cache.  You have a choice of flash module sizes to include on
the card.   The card supports RAID modes as well as JBOD.
I do not know if Gluster can make use of battery-backed flash-based
Cache when the disks are presented by the RAID card in JBOD.   The
Hardware vendor asked "Do you know if Gluster makes use of flash-cache
in JBOD?"
If it does, I'm not certain how the size of this flash cache affects
the operation.
*
*I thought you wanted to use ZFS underneath Gluster? You need to state
your use case properly so we can help you.

Pretty much any RAID card I've come across does not use the on-board
cache (which is always RAM, backup up with a BBU or a Flash
chip(s)+Supercaps) when set to JBOD mode. You could use a RAID
controller with GlusterFS with XFS or EXT4 underneath but it why not
just use the redundancy built into GlusterFS? You could still use
software RAID if you're really concerned about your data.

If you're using ZFS you don't even want to use a RAID firmware/card set
to JBOD mode, you need an HBA or a dual-purpose RAID/HBA card running
the HBA firmware (IT-mode in LSI/Avago/Broadcom cards, eg IBM M1015, 1115).

Even if you use XFS or EXT4 underneath Gluster I'd still look at leaving
out a RAID capable controller, as if you can't get the same model and
the card fails you can't just plonk the drives into any other box with
SATA/SAS ports and just carry on as before.

In either case, don't use desktop drives as they often lie about if they
flush their own RAM cache. Use nearline enterprise SATA or SAS drives.

Part of the point of GlusterFS and ZFS is it's "software defined", you
use fast but dumb drive controllers so you don't have to ever worry
again about hardware compatibility and availability, it's all in the
OS/FS stack and the hardware conforms to open standards.

Eg:

Client apps > GlusterFS > ZFS | > HBA > JBOD
Client apps > GlusterFS> XFS | > HBA > JBOD

All the caching, resilience, failover is handled above the place where
I've put the pipe character. This means your HBA and enclosures can go
up in smoke, as long as you still have the drives you'll have your data.

What are your plans WRT your underlying brick FS? ZFS or other?





****

--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.

"Transact" is operated by Integrated Financial Arrangements Ltd.
29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).
Alex Crow
2018-04-09 17:49:44 UTC
Permalink
Post by Vincent Royer
Is a flash-backed Raid required for JBOD, and should it be 1gb, 2,
or 4gb flash?
RAID and JBOD are completely different things. JBODs are just that,
bunches of disks, and they don't have any cache above them in hardware.
If you're going to use ZFS under Gluster, look at the ZFS docs first.
Short answer is no. If Gluster passes sync writes down to the lower
level FS as sync, and you decide to use a ZFS SLOG device, usually an
SSD, it should have power fail protection capacitors.

Do *not* use a RAID controller for ZFS, use a decent HBA instead, so ZFS
can access the disks directly. Using a RAID controller is not only
wasteful it's setting yourself up for a world of pain when a ZFS VDEV
member device fails. Don't do it.


--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.

"Transact" is operated by Integrated Financial Arrangements Ltd.
29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).
Vincent Royer
2018-04-09 18:02:27 UTC
Permalink
Thanks,

I suppose what I'm trying to gain is some clarity on what choice is best
for a given application. How do I know if it's better for me to use a raid
card or not, to include flash-cache on it or not, to use ZFS or not, when
combined with a small number of SSDs in Replica 3.
Post by Alex Crow
Post by Vincent Royer
Is a flash-backed Raid required for JBOD, and should it be 1gb, 2, or 4gb
flash?
RAID and JBOD are completely different things. JBODs are just that,
bunches of disks, and they don't have any cache above them in hardware. If
you're going to use ZFS under Gluster, look at the ZFS docs first. Short
answer is no. If Gluster passes sync writes down to the lower level FS as
sync, and you decide to use a ZFS SLOG device, usually an SSD, it should
have power fail protection capacitors.
Do *not* use a RAID controller for ZFS, use a decent HBA instead, so ZFS
can access the disks directly. Using a RAID controller is not only wasteful
it's setting yourself up for a world of pain when a ZFS VDEV member device
fails. Don't do it.
--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.
"Transact" is operated by Integrated Financial Arrangements Ltd.
29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Alex Crow
2018-04-09 18:25:15 UTC
Permalink
Post by Vincent Royer
Thanks,
I suppose what I'm trying to gain is some clarity on what choice is
best for a given application.  How do I know if it's better for me to
use a raid card or not, to include flash-cache on it or not, to use
ZFS or not, when combined with a small number of SSDs in Replica 3.
How few is "small number" - most importantly number per server? Replica
3 is a start, as it already tells us you can lose at one entire server
and carry on as normal. If you lose two, your GlusterFS is down.

What is your resilience goal? You should really be starting with
requirements, not speccing out and buying servers and drives and then
trying to force them to fit your expectations.

--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.

"Transact" is operated by Integrated Financial Arrangements Ltd.
29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).
Vincent Royer
2018-04-09 21:15:27 UTC
Permalink
Thanks,

The 3 servers are new Lenovo units with redundant PS backed by two huge UPS
units (on for each bank of power supplies). I think the chances of losing
two nodes is incredibly slim, and in that case a Disaster Recovery from
offsite backups would be reasonable.

My requirements are about 2TB, highly available (so that I can reboot one
of the 3 servers without taking down services).

Beyond that my focus is high performance for small I/O.

So I could do a single 2TB SSD per server, or two, or many more if that is
"what is required". But I don't want to waste money...

I like the idea of forgoing the RAID cards as they are quite expensive,
especially the capacitor backed ones. The onboard controller can handle
JBOD just fine, if Gluster is OK with it!

*Vincent Royer*
*778-825-1057*


<http://www.epicenergy.ca/>
*SUSTAINABLE MOBILE ENERGY SOLUTIONS*
Post by Vincent Royer
Thanks,
I suppose what I'm trying to gain is some clarity on what choice is best
for a given application. How do I know if it's better for me to use a raid
card or not, to include flash-cache on it or not, to use ZFS or not, when
combined with a small number of SSDs in Replica 3.
How few is "small number" - most importantly number per server? Replica 3
is a start, as it already tells us you can lose at one entire server and
carry on as normal. If you lose two, your GlusterFS is down.
What is your resilience goal? You should really be starting with
requirements, not speccing out and buying servers and drives and then
trying to force them to fit your expectations.
--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.
"Transact" is operated by Integrated Financial Arrangements Ltd.29 Clement's Lane, London EC4N 7AE <https://maps.google.com/?q=29+Clement's+Lane,+London+EC4N+7AE&entry=gmail&source=g>. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).
Alex Crow
2018-04-12 19:25:15 UTC
Permalink
Post by Vincent Royer
Thanks,
The 3 servers are new Lenovo units with redundant PS backed by two
huge UPS units (on for each bank of power supplies).  I think the
chances of losing two nodes is incredibly slim, and in that case a
Disaster Recovery from offsite backups would be reasonable.
My requirements are about 2TB, highly available (so that I can reboot
one of the 3 servers without taking down services).
Beyond that my focus is high performance for small I/O.
This can be a difficult case for GlusterFS, if you mean "small files",
as the metadata lookups are relatively costly (no separate MDS with
in-memory or memory cached database). It's ideally placed for large
files, and small I/O within those files should be OK. just speaking from
experience - should be fine for VMs with such loads, especially if you
shard.
Post by Vincent Royer
So I could do a single 2TB SSD per server, or two, or many more if
that is "what is required".  But I don't want to waste money...
Resilience is never a waste. Skimping may well prove to be a waste of
*your time* when you get woken up at 3am and have to fix a downed
system. Your call entirely. I'm too old for that kind of thing, so I
tend to push for both per-server and per-cluster redundancy. It usually
gets approved after something "unexpected" happens the first time.

Gluster and ZFS will be fine with onboard controllers. If you have
enough ports you'll be just fine. If you need more buy HBA's to stick in
your PCIe slots, M1015s and M1115s on ebay perform very well and are
still dirt cheap.

So are you using ZFS to get compression and checksumming down to the
disk platter level? ZFS will give some gains in performance with
compressible data and corruption protection, but, don't bother with
dedup, I've tried it on 3 distributed filesystems and it bought less
than 3% capacity and slammed performance. If you don't need either
feature just stick with XFS for single-disk or software-RAIDed mirrors
per brick. My personal opinion would be do a ZFS mirror of two SSDs per
server, per brick, ie in your initial case, 2x2TB SSD per box in ZFS
mirror. You can add more mirror sets later to add additional bricks.
Post by Vincent Royer
I like the idea of forgoing the RAID cards as they are quite
expensive, especially the capacitor backed ones.  The onboard
controller can handle JBOD just fine, if Gluster is OK with it!
As I also said, if said expensive card dies, and you don't have another
one in stock, you will effectively have lost everything on that server
until you can source a new one (or even /if/ you can).

Use the power of the software to get where you need to be, the tools are
there...

Alex

--
This message is intended only for the addressee and may contain
confidential information. Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.
This email is not intended to, nor should it be taken to, constitute advice.
The information provided is correct to our knowledge & belief and must not
be used as a substitute for obtaining tax, regulatory, investment, legal or
any other appropriate advice.

"Transact" is operated by Integrated Financial Arrangements Ltd.
29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020) 7608 5300.
(Registered office: as above; Registered in England and Wales under
number: 3727592). Authorised and regulated by the Financial Conduct
Authority (entered on the Financial Services Register; no. 190856).
Loading...