[Gluster-users] Horrible Gluster Performance

Discussion:

Philip

2012-04-13 09:25:58 UTC

I have a small GlusterFS Cluster providing a replicated volume. Each server
has 2 SAS disks for the OS and logs and 22 SATA disks for the actual data
striped together as a RAID10 using MegaRAID SAS 9280-4i4e with this
configuration: http://pastebin.com/2xj4401J

Connected to this cluster are a few other servers with the native client
running nginx to serve files stored on it in the order of 3-10MB.

Right now a storage server has a outgoing bandwith of 300Mbit/s and the
busy rate of the raid array is at 30-40%. There are also strange
side-effects: Sometimes the io-latency skyrockets and there is no access
possible on the raid for >10 seconds. This happens at 300Mbit/s or
1000Mbit/s of outgoing bandwidth. The file system used is xfs and it has
been tuned to match the raid stripe size.

I've tested all sorts of gluster settings but none seem to have any effect
because of that I've reset the volume configuration and it is using the
default one.

Does anyone have an idea what could be the reason for such a bad
performance? 22 Disks in a RAID10 should deliver *way* more throughput.

Brian Candler

2012-04-13 09:42:39 UTC

Permalink

Post by Philip
Sometimes the io-latency skyrockets and there is no
access possible on the raid for >10 seconds.

Have you checked
http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/
?

If you have a large amount of RAM and a lot of writes, maybe you're
accumulating large amounts of dirty data which is then being synchronously
flushed.

Philip

2012-04-13 10:06:23 UTC

Permalink

Post by Brian Candler

Post by Philip
Sometimes the io-latency skyrockets and there is no
access possible on the raid for >10 seconds.

Have you checked
http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/
?
If you have a large amount of RAM and a lot of writes, maybe you're
accumulating large amounts of dirty data which is then being synchronously
flushed.

I indeed have lots of ram but I've disabled all write access for a few
hours to check if this issue is write related. These latency spikes did
also happen when there were no writes at all.

I haven't changed these values. They were at
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20

I've changed it to
vm.dirty_background_ratio = 2
vm.dirty_ratio = 10

Jerker Nyberg

2012-04-13 09:58:36 UTC

Permalink

Post by Philip
Does anyone have an idea what could be the reason for such a bad
performance? 22 Disks in a RAID10 should deliver *way* more throughput.

Philip

2012-04-13 10:10:12 UTC

Permalink

Post by Philip
Does anyone have an idea what could be the reason for such a bad

Post by Philip
performance? 22 Disks in a RAID10 should deliver *way* more throughput.

You may already have done so but you can check IO-utilization of the
devices with the flag "-x" to "iostat" like for example "iostat -x 2" over
a two second interval. Check percentage utilization in the "%util" column
to the right. If you are closer to 100 than 0 then they (the disk
subsystem) might actually be busy.
--jerker
______________________________**_________________
Gluster-users mailing list
http://gluster.org/cgi-bin/**mailman/listinfo/gluster-users<http://gluster.org/cgi-bin/mailman/listinfo/gluster-users>

Here is the output (Outgoing bandwidth is currently at 380 Mbps):

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 1.00 0.00 129.50 0.00 42624.00 0.00
329.14 1.53 10.97 3.63 47.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.06 0.00 0.53 2.82 0.00 96.60

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.50 0.00 0.50 0.00 8.00
16.00 0.00 0.00 0.00 0.00
sdb 1.00 0.00 184.00 47.50 64084.00 29412.00
403.87 1.81 8.29 2.16 50.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.06 0.00 0.59 2.50 0.00 96.85

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 1.00 0.00 156.50 0.00 54944.00 0.00
351.08 1.30 8.28 2.62 41.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.21 0.00 0.48 1.61 0.00 97.70