Discussion:
[Gluster-users] Gluster very poor performance when copying small files (1x (2+1) = 3, SSD)
Sam McLeod
2018-03-18 22:13:36 UTC
Permalink
Howdy all,

We're experiencing terrible small file performance when copying or moving files on gluster clients.

In the example below, Gluster is taking 6mins~ to copy 128MB / 21,000 files sideways on a client, doing the same thing on NFS (which I know is a totally different solution etc. etc.) takes approximately 10-15 seconds(!).

Any advice for tuning the volume or XFS settings would be greatly appreciated.

Hopefully I've included enough relevant information below.


## Gluster Client

***@gluster-client:/mnt/gluster_perf_test/ # du -sh .
127M .
***@gluster-client:/mnt/gluster_perf_test/ # find . -type f | wc -l
21791
***@gluster-client:/mnt/gluster_perf_test/ # du 9584toto9584.txt
4 9584toto9584.txt


***@gluster-client:/mnt/gluster_perf_test/ # time cp -a private private_perf_test

real 5m51.862s
user 0m0.862s
sys 0m8.334s

***@gluster-client:/mnt/gluster_perf_test/ # time rm -rf private_perf_test/

real 0m49.702s
user 0m0.087s
sys 0m0.958s


## Hosts

- 16x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz per Gluster host / client
- Storage: iSCSI provisioned (via 10Gbit DAC/Fibre), SSD disk, 50K R/RW 4k IOP/s, 400MB/s per Gluster host
- Volumes are replicated across two hosts and one arbiter only host
- Networking is 10Gbit DAC/Fibre between Gluster hosts and clients
- 18GB DDR4 ECC memory

## Volume Info

***@gluster-host-01:~ # gluster pool list
UUID Hostname State
ad02970b-e2aa-4ca8-998c-bd10d5970faa gluster-host-02.fqdn Connected
ea116a94-c19e-48db-b108-0be3ae622e2e gluster-host-03.fqdn Connected
2e855c25-e7ac-4ff6-be85-e8bcc6f45ee4 localhost Connected

***@gluster-host-01:~ # gluster volume info uat_storage

Volume Name: uat_storage
Type: Replicate
Volume ID: 7918f1c5-5031-47b8-b054-56f6f0c569a2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: gluster-host-01.fqdn:/mnt/gluster-storage/uat_storage
Brick2: gluster-host-02.fqdn:/mnt/gluster-storage/uat_storage
Brick3: gluster-host-03.fqdn:/mnt/gluster-storage/uat_storage (arbiter)
Options Reconfigured:
performance.rda-cache-limit: 256MB
network.inode-lru-limit: 50000
server.outstanding-rpc-limit: 256
performance.client-io-threads: true
nfs.disable: on
transport.address-family: inet
client.event-threads: 8
cluster.eager-lock: true
cluster.favorite-child-policy: size
cluster.lookup-optimize: true
cluster.readdir-optimize: true
cluster.use-compound-fops: true
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
features.cache-invalidation-timeout: 600
features.cache-invalidation: true
network.ping-timeout: 15
performance.cache-invalidation: true
performance.cache-max-file-size: 6MB
performance.cache-refresh-timeout: 60
performance.cache-size: 1024MB
performance.io-thread-count: 16
performance.md-cache-timeout: 600
performance.stat-prefetch: true
performance.write-behind-window-size: 256MB
server.event-threads: 8
transport.listen-backlog: 2048

***@gluster-host-01:~ # xfs_info /dev/mapper/gluster-storage-unlocked
meta-data=/dev/mapper/gluster-storage-unlocked isize=512 agcount=4, agsize=196607360 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=786429440, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=8192 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=383998, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0


--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod

Words are my own opinions and do not necessarily represent those of my employer or partners.
TomK
2018-03-18 23:37:25 UTC
Permalink
On 3/18/2018 6:13 PM, Sam McLeod wrote:
Even your NFS transfers are 12.5 or so MB per second or less.

1) Did you use fdisk and LVM under that XFS filesystem?

2) Did you benchmark the XFS with something like bonnie++? (There's
probably newer benchmark suites now.)

3) Did you benchmark your Network transfer speeds? Perhaps your NIC
negotiated a lower speed.

3) I've done XFS tuning for another purpose but got good results. If it
helps, I can send you the doc.

Cheers,
Tom
Post by Sam McLeod
Howdy all,
We're experiencing terrible small file performance when copying or
moving files on gluster clients.
In the example below, Gluster is taking 6mins~ to copy 128MB / 21,000
files sideways on a client, doing the same thing on NFS (which I know is
a totally different solution etc. etc.) takes approximately 10-15
seconds(!).
Any advice for tuning the volume or XFS settings would be greatly appreciated.
Hopefully I've included enough relevant information below.
## Gluster Client
127M    .
21791
4    9584toto9584.txt
private_perf_test
real    5m51.862s
user    0m0.862s
sys    0m8.334s
real    0m49.702s
user    0m0.087s
sys    0m0.958s
## Hosts
- Storage: iSCSI provisioned (via 10Gbit DAC/Fibre), SSD disk, 50K R/RW
4k IOP/s, 400MB/s per Gluster host
- Volumes are replicated across two hosts and one arbiter only host
- Networking is 10Gbit DAC/Fibre between Gluster hosts and clients
- 18GB DDR4 ECC memory
## Volume Info
UUID          Hostname                        State
ad02970b-e2aa-4ca8-998c-bd10d5970faa  gluster-host-02.fqdn Connected
ea116a94-c19e-48db-b108-0be3ae622e2e  gluster-host-03.fqdn Connected
2e855c25-e7ac-4ff6-be85-e8bcc6f45ee4  localhost
Connected
Volume Name: uat_storage
Type: Replicate
Volume ID: 7918f1c5-5031-47b8-b054-56f6f0c569a2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: gluster-host-01.fqdn:/mnt/gluster-storage/uat_storage
Brick2: gluster-host-02.fqdn:/mnt/gluster-storage/uat_storage
Brick3: gluster-host-03.fqdn:/mnt/gluster-storage/uat_storage (arbiter)
performance.rda-cache-limit: 256MB
network.inode-lru-limit: 50000
server.outstanding-rpc-limit: 256
performance.client-io-threads: true
nfs.disable: on
transport.address-family: inet
client.event-threads: 8
cluster.eager-lock: true
cluster.favorite-child-policy: size
cluster.lookup-optimize: true
cluster.readdir-optimize: true
cluster.use-compound-fops: true
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
features.cache-invalidation-timeout: 600
features.cache-invalidation: true
network.ping-timeout: 15
performance.cache-invalidation: true
performance.cache-max-file-size: 6MB
performance.cache-refresh-timeout: 60
performance.cache-size: 1024MB
performance.io <http://performance.io>-thread-count: 16
performance.md-cache-timeout: 600
performance.stat-prefetch: true
performance.write-behind-window-size: 256MB
server.event-threads: 8
transport.listen-backlog: 2048
meta-data=/dev/mapper/gluster-storage-unlocked isize=512    agcount=4,
agsize=196607360 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=786429440, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=383998, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of
my employer or partners.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.
Sam McLeod
2018-03-19 00:20:14 UTC
Permalink
Hi Tom,

Thanks for your reply.

1. Yes XFS is on a LUKs LV (see below).
2. Yes, I prefer FIO but each Gluster host gets between 50-100K 4K random IOP/s both write and read to disk.
3. Yes, we actually use 2x 10Gbit DACs in LACP, but we get full 10Gbit speeds (and very low latency thanks to the DACs).
4. I'd love to see that, it'd be much appreciated thanks.



# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvdc 202:32 0 1.5T 0 disk
└─xvdc1 202:33 0 1.5T 0 part
└─gluster-storage 253:1 0 3T 0 lvm
└─gluster-storage-unlocked 253:3 0 3T 0 crypt /mnt/gluster-storage
xvda 202:0 0 18G 0 disk
├─xvda2 202:2 0 17.5G 0 part
│ ├─centos-var 253:2 0 9.5G 0 lvm /var
│ └─centos-root 253:0 0 8G 0 lvm /
└─xvda1 202:1 0 500M 0 part /boot
sr0 11:0 1 1024M 0 rom
xvdb 202:16 0 1.5T 0 disk
└─xvdb1 202:17 0 1.5T 0 part
└─gluster-storage 253:1 0 3T 0 lvm
└─gluster-storage-unlocked 253:3 0 3T 0 crypt /mnt/gluster-storage

--
Sam McLeod
Please respond via email when possible.
https://smcleod.net
https://twitter.com/s_mcleod
Post by TomK
Even your NFS transfers are 12.5 or so MB per second or less.
1) Did you use fdisk and LVM under that XFS filesystem?
2) Did you benchmark the XFS with something like bonnie++? (There's probably newer benchmark suites now.)
3) Did you benchmark your Network transfer speeds? Perhaps your NIC negotiated a lower speed.
3) I've done XFS tuning for another purpose but got good results. If it helps, I can send you the doc.
Cheers,
Tom
Post by Sam McLeod
Howdy all,
We're experiencing terrible small file performance when copying or moving files on gluster clients.
In the example below, Gluster is taking 6mins~ to copy 128MB / 21,000 files sideways on a client, doing the same thing on NFS (which I know is a totally different solution etc. etc.) takes approximately 10-15 seconds(!).
Any advice for tuning the volume or XFS settings would be greatly appreciated.
Hopefully I've included enough relevant information below.
## Gluster Client
127M .
21791
4 9584toto9584.txt
real 5m51.862s
user 0m0.862s
sys 0m8.334s
real 0m49.702s
user 0m0.087s
sys 0m0.958s
## Hosts
- Storage: iSCSI provisioned (via 10Gbit DAC/Fibre), SSD disk, 50K R/RW 4k IOP/s, 400MB/s per Gluster host
- Volumes are replicated across two hosts and one arbiter only host
- Networking is 10Gbit DAC/Fibre between Gluster hosts and clients
- 18GB DDR4 ECC memory
## Volume Info
UUID Hostname State
ad02970b-e2aa-4ca8-998c-bd10d5970faa gluster-host-02.fqdn Connected
ea116a94-c19e-48db-b108-0be3ae622e2e gluster-host-03.fqdn Connected
2e855c25-e7ac-4ff6-be85-e8bcc6f45ee4 localhost Connected
Volume Name: uat_storage
Type: Replicate
Volume ID: 7918f1c5-5031-47b8-b054-56f6f0c569a2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: gluster-host-01.fqdn:/mnt/gluster-storage/uat_storage
Brick2: gluster-host-02.fqdn:/mnt/gluster-storage/uat_storage
Brick3: gluster-host-03.fqdn:/mnt/gluster-storage/uat_storage (arbiter)
performance.rda-cache-limit: 256MB
network.inode-lru-limit: 50000
server.outstanding-rpc-limit: 256
performance.client-io-threads: true
nfs.disable: on
transport.address-family: inet
client.event-threads: 8
cluster.eager-lock: true
cluster.favorite-child-policy: size
cluster.lookup-optimize: true
cluster.readdir-optimize: true
cluster.use-compound-fops: true
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
features.cache-invalidation-timeout: 600
features.cache-invalidation: true
network.ping-timeout: 15
performance.cache-invalidation: true
performance.cache-max-file-size: 6MB
performance.cache-refresh-timeout: 60
performance.cache-size: 1024MB
performance.io <http://performance.io/> <http://performance.io <http://performance.io/>>-thread-count: 16
performance.md-cache-timeout: 600
performance.stat-prefetch: true
performance.write-behind-window-size: 256MB
server.event-threads: 8
transport.listen-backlog: 2048
meta-data=/dev/mapper/gluster-storage-unlocked isize=512 agcount=4, agsize=196607360 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=786429440, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=8192 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=383998, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of my employer or partners.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users <http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Cheers,
Tom K.
-------------------------------------------------------------------------------------
Living on earth is expensive, but it includes a free trip around the sun.
Rik Theys
2018-03-19 09:37:57 UTC
Permalink
Hi,

I've done some similar tests and experience similar performance issues
(see my 'gluster for home directories?' thread on the list).

If I read your mail correctly, you are comparing an NFS mount of the
brick disk against a gluster mount (using the fuse client)?

Which options do you have set on the NFS export (sync or async)?

From my tests, I concluded that the issue was not bandwidth but latency.
Gluster will only return an IO operation once all bricks have confirmed
that the data is on disk. If you are using a fuse mount, you might
compare with using the 'direct-io-mode=disable' option on the client
might help (no experience with this).

In our tests, I've used NFS-ganesha to serve the gluster volume over
NFS. This makes things even worse as NFS-ganesha has no "async" mode,
which makes performance terrible.

If you find a magic knob to make glusterfs fast on small-file workloads,
do let me know!

Regards,

Rik
Post by Sam McLeod
Howdy all,
We're experiencing terrible small file performance when copying or
moving files on gluster clients.
In the example below, Gluster is taking 6mins~ to copy 128MB / 21,000
files sideways on a client, doing the same thing on NFS (which I know is
a totally different solution etc. etc.) takes approximately 10-15
seconds(!).
Any advice for tuning the volume or XFS settings would be greatly appreciated.
Hopefully I've included enough relevant information below.
## Gluster Client
127M    .
21791
4    9584toto9584.txt
private_perf_test
real    5m51.862s
user    0m0.862s
sys    0m8.334s
real    0m49.702s
user    0m0.087s
sys    0m0.958s
## Hosts
- Storage: iSCSI provisioned (via 10Gbit DAC/Fibre), SSD disk, 50K R/RW
4k IOP/s, 400MB/s per Gluster host
- Volumes are replicated across two hosts and one arbiter only host
- Networking is 10Gbit DAC/Fibre between Gluster hosts and clients
- 18GB DDR4 ECC memory
## Volume Info
UUID          Hostname                        State
ad02970b-e2aa-4ca8-998c-bd10d5970faa  gluster-host-02.fqdn Connected
ea116a94-c19e-48db-b108-0be3ae622e2e  gluster-host-03.fqdn Connected
2e855c25-e7ac-4ff6-be85-e8bcc6f45ee4  localhost                      
Connected
Volume Name: uat_storage
Type: Replicate
Volume ID: 7918f1c5-5031-47b8-b054-56f6f0c569a2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: gluster-host-01.fqdn:/mnt/gluster-storage/uat_storage
Brick2: gluster-host-02.fqdn:/mnt/gluster-storage/uat_storage
Brick3: gluster-host-03.fqdn:/mnt/gluster-storage/uat_storage (arbiter)
performance.rda-cache-limit: 256MB
network.inode-lru-limit: 50000
server.outstanding-rpc-limit: 256
performance.client-io-threads: true
nfs.disable: on
transport.address-family: inet
client.event-threads: 8
cluster.eager-lock: true
cluster.favorite-child-policy: size
cluster.lookup-optimize: true
cluster.readdir-optimize: true
cluster.use-compound-fops: true
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
features.cache-invalidation-timeout: 600
features.cache-invalidation: true
network.ping-timeout: 15
performance.cache-invalidation: true
performance.cache-max-file-size: 6MB
performance.cache-refresh-timeout: 60
performance.cache-size: 1024MB
performance.io <http://performance.io>-thread-count: 16
performance.md-cache-timeout: 600
performance.stat-prefetch: true
performance.write-behind-window-size: 256MB
server.event-threads: 8
transport.listen-backlog: 2048
meta-data=/dev/mapper/gluster-storage-unlocked isize=512    agcount=4,
agsize=196607360 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=786429440, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=383998, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of
my employer or partners.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT)
Kasteelpark Arenberg 10 bus 2440 - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>>
Ondrej Valousek
2018-03-19 09:42:46 UTC
Permalink
Hi,
As I posted in my previous emails - glusterfs can never match NFS (especially async one) performance of small files/latency. That's given by the design.
Nothing you can do about it.
Ondrej

-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Rik Theys
Sent: Monday, March 19, 2018 10:38 AM
To: gluster-***@gluster.org; ***@smcleod.net
Subject: Re: [Gluster-users] Gluster very poor performance when copying small files (1x (2+1) = 3, SSD)

Hi,

I've done some similar tests and experience similar performance issues (see my 'gluster for home directories?' thread on the list).

If I read your mail correctly, you are comparing an NFS mount of the brick disk against a gluster mount (using the fuse client)?

Which options do you have set on the NFS export (sync or async)?

From my tests, I concluded that the issue was not bandwidth but latency.
Gluster will only return an IO operation once all bricks have confirmed that the data is on disk. If you are using a fuse mount, you might compare with using the 'direct-io-mode=disable' option on the client might help (no experience with this).

In our tests, I've used NFS-ganesha to serve the gluster volume over NFS. This makes things even worse as NFS-ganesha has no "async" mode, which makes performance terrible.

If you find a magic knob to make glusterfs fast on small-file workloads, do let me know!

Regards,

Rik
Post by Sam McLeod
Howdy all,
We're experiencing terrible small file performance when copying or
moving files on gluster clients.
In the example below, Gluster is taking 6mins~ to copy 128MB / 21,000
files sideways on a client, doing the same thing on NFS (which I know
is a totally different solution etc. etc.) takes approximately 10-15
seconds(!).
Any advice for tuning the volume or XFS settings would be greatly appreciated.
Hopefully I've included enough relevant information below.
## Gluster Client
127M    .
21791
4    9584toto9584.txt
private_perf_test
real    5m51.862s
user    0m0.862s
sys    0m8.334s
private_perf_test/
real    0m49.702s
user    0m0.087s
sys    0m0.958s
## Hosts
- Storage: iSCSI provisioned (via 10Gbit DAC/Fibre), SSD disk, 50K
R/RW 4k IOP/s, 400MB/s per Gluster host
- Volumes are replicated across two hosts and one arbiter only host
- Networking is 10Gbit DAC/Fibre between Gluster hosts and clients
- 18GB DDR4 ECC memory
## Volume Info
State ad02970b-e2aa-4ca8-998c-bd10d5970faa  gluster-host-02.fqdn
Connected ea116a94-c19e-48db-b108-0be3ae622e2e  gluster-host-03.fqdn
Connected
2e855c25-e7ac-4ff6-be85-e8bcc6f45ee4  localhost Connected
Volume Name: uat_storage
Type: Replicate
Volume ID: 7918f1c5-5031-47b8-b054-56f6f0c569a2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: gluster-host-01.fqdn:/mnt/gluster-storage/uat_storage
Brick2: gluster-host-02.fqdn:/mnt/gluster-storage/uat_storage
Brick3: gluster-host-03.fqdn:/mnt/gluster-storage/uat_storage
performance.rda-cache-limit: 256MB
network.inode-lru-limit: 50000
server.outstanding-rpc-limit: 256
performance.client-io-threads: true
nfs.disable: on
transport.address-family: inet
client.event-threads: 8
cluster.eager-lock: true
cluster.favorite-child-policy: size
cluster.lookup-optimize: true
cluster.readdir-optimize: true
cluster.use-compound-fops: true
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
features.cache-invalidation-timeout: 600
features.cache-invalidation: true
network.ping-timeout: 15
performance.cache-invalidation: true
performance.cache-max-file-size: 6MB
performance.cache-refresh-timeout: 60
performance.cache-size: 1024MB
performance.io <http://performance.io>-thread-count: 16
performance.md-cache-timeout: 600
performance.stat-prefetch: true
performance.write-behind-window-size: 256MB
server.event-threads: 8
transport.listen-backlog: 2048
meta-data=/dev/mapper/gluster-storage-unlocked isize=512    agcount=4,
agsize=196607360 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0 data    
=                       bsize=4096   blocks=786429440, imaxpct=5
         =                       sunit=0      swidth=0 blks naming  
=version 2              bsize=8192   ascii-ci=0 ftype=1 log      
=internal               bsize=4096   blocks=383998, version=2
         =                       sectsz=512   sunit=0 blks,
lazy-count=1 realtime =none                   extsz=4096   blocks=0,
rtextents=0
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of my 
employer or partners.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT) Kasteelpark Arenberg 10 bus 2440 - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>> _______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users
-----

The information contained in this e-mail and in any attachments is confidential and is designated solely for the attention of the intended recipient(s). If you are not an intended recipient, you must not use, disclose, copy, distribute or retain this e-mail or any part thereof. If you have received this e-mail in error, please notify the sender by return e-mail and delete all copies of this e-mail from your computer system(s). Please direct any additional queries to: ***@s3group.com. Thank You. Silicon and Software Systems Limited (S3 Group). Registered in Ireland no. 378073. Registered Office: South County Business Park, Leopardstown, Dublin 18.
TomK
2018-03-19 14:42:18 UTC
Permalink
On 3/19/2018 5:42 AM, Ondrej Valousek wrote:
Removing NFS or NFS Ganesha from the equation, not very impressed on my
own setup either. For the writes it's doing, that's alot of CPU usage
in top. Seems bottle-necked via a single execution core somewhere trying
to facilitate read / writes to the other bricks.

Writes to the gluster FS from within one of the gluster participating
bricks:

[***@nfs01 n]# dd if=/dev/zero of=./some-file.bin

393505+0 records in
393505+0 records out
201474560 bytes (201 MB) copied, 50.034 s, 4.0 MB/s

[***@nfs01 n]#

Top results (10 second average)won't go over 32%:

top - 00:49:38 up 21:39, 2 users, load average: 0.42, 0.24, 0.19
Tasks: 164 total, 1 running, 163 sleeping, 0 stopped, 0 zombie
%Cpu0 : 29.3 us, 24.7 sy, 0.0 ni, 45.1 id, 0.0 wa, 0.0 hi, 0.8 si,
0.0 st
%Cpu1 : 27.2 us, 24.1 sy, 0.0 ni, 47.2 id, 0.0 wa, 0.0 hi, 1.5 si,
0.0 st
%Cpu2 : 20.2 us, 13.5 sy, 0.0 ni, 64.1 id, 0.0 wa, 0.0 hi, 2.3 si,
0.0 st
%Cpu3 : 30.0 us, 16.2 sy, 0.0 ni, 47.5 id, 0.0 wa, 0.0 hi, 6.3 si,
0.0 st
KiB Mem : 3881708 total, 3207488 free, 346680 used, 327540 buff/cache
KiB Swap: 4063228 total, 4062828 free, 400 used. 3232208 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1319 root 20 0 819036 12928 4036 S 32.3 0.3 1:19.64
glusterfs
1310 root 20 0 1232428 25636 4364 S 12.1 0.7 0:41.25
glusterfsd


Next, the same write but directly to the brick via XFS, which of course
is faster:


top - 09:45:09 up 1 day, 6:34, 3 users, load average: 0.61, 1.01, 1.04
Tasks: 171 total, 2 running, 169 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.6 us, 2.1 sy, 0.0 ni, 82.6 id, 14.5 wa, 0.0 hi, 0.2 si,
0.0 st
%Cpu1 : 16.7 us, 83.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si,
0.0 st
%Cpu2 : 0.4 us, 0.9 sy, 0.0 ni, 94.2 id, 4.4 wa, 0.0 hi, 0.0 si,
0.0 st
%Cpu3 : 1.1 us, 0.6 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si,
0.0 st
KiB Mem : 3881708 total, 501120 free, 230704 used, 3149884 buff/cache
KiB Swap: 4063228 total, 3876896 free, 186332 used. 3343960 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14691 root 20 0 107948 608 512 R 25.0 0.0 0:34.29 dd
1334 root 20 0 2694264 61076 2228 S 2.7 1.6 283:55.96
ganesha.nfsd


The result of a dd command directly against the brick FS itself is of
course much better:


[***@nfs01 gv01]# dd if=/dev/zero of=./some-file.bin
5771692+0 records in
5771692+0 records out
2955106304 bytes (3.0 GB) copied, 35.3425 s, 83.6 MB/s

[***@nfs01 gv01]# pwd
/bricks/0/gv01
[***@nfs01 gv01]#

Tried a few tweak options with no effect:

[***@nfs01 glusterfs]# gluster volume info

Volume Name: gv01
Type: Replicate
Volume ID: e5ccc75e-5192-45ac-b410-a34ebd777666
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: nfs01:/bricks/0/gv01
Brick2: nfs02:/bricks/0/gv01
Options Reconfigured:
cluster.server-quorum-type: server
cluster.quorum-type: auto
server.event-threads: 8
client.event-threads: 8
performance.readdir-ahead: on
performance.write-behind-window-size: 8MB
performance.io-thread-count: 16
performance.cache-size: 1GB
nfs.trusted-sync: on
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
[***@nfs01 glusterfs]#

That's despite that I can confirm doing 90+ MB/s on my 1Gbe network.
Thoughts?
--
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.
Post by Ondrej Valousek
Hi,
As I posted in my previous emails - glusterfs can never match NFS (especially async one) performance of small files/latency. That's given by the design.
Nothing you can do about it.
Ondrej
-----Original Message-----
Sent: Monday, March 19, 2018 10:38 AM
Subject: Re: [Gluster-users] Gluster very poor performance when copying small files (1x (2+1) = 3, SSD)
Hi,
I've done some similar tests and experience similar performance issues (see my 'gluster for home directories?' thread on the list).
If I read your mail correctly, you are comparing an NFS mount of the brick disk against a gluster mount (using the fuse client)?
Which options do you have set on the NFS export (sync or async)?
From my tests, I concluded that the issue was not bandwidth but latency.
Gluster will only return an IO operation once all bricks have confirmed that the data is on disk. If you are using a fuse mount, you might compare with using the 'direct-io-mode=disable' option on the client might help (no experience with this).
In our tests, I've used NFS-ganesha to serve the gluster volume over NFS. This makes things even worse as NFS-ganesha has no "async" mode, which makes performance terrible.
If you find a magic knob to make glusterfs fast on small-file workloads, do let me know!
Regards,
Rik
Post by Sam McLeod
Howdy all,
We're experiencing terrible small file performance when copying or
moving files on gluster clients.
In the example below, Gluster is taking 6mins~ to copy 128MB / 21,000
files sideways on a client, doing the same thing on NFS (which I know
is a totally different solution etc. etc.) takes approximately 10-15
seconds(!).
Any advice for tuning the volume or XFS settings would be greatly appreciated.
Hopefully I've included enough relevant information below.
## Gluster Client
127M    .
21791
4    9584toto9584.txt
private_perf_test
real    5m51.862s
user    0m0.862s
sys    0m8.334s
private_perf_test/
real    0m49.702s
user    0m0.087s
sys    0m0.958s
## Hosts
- Storage: iSCSI provisioned (via 10Gbit DAC/Fibre), SSD disk, 50K
R/RW 4k IOP/s, 400MB/s per Gluster host
- Volumes are replicated across two hosts and one arbiter only host
- Networking is 10Gbit DAC/Fibre between Gluster hosts and clients
- 18GB DDR4 ECC memory
## Volume Info
State ad02970b-e2aa-4ca8-998c-bd10d5970faa  gluster-host-02.fqdn
Connected ea116a94-c19e-48db-b108-0be3ae622e2e  gluster-host-03.fqdn
Connected
2e855c25-e7ac-4ff6-be85-e8bcc6f45ee4  localhost Connected
Volume Name: uat_storage
Type: Replicate
Volume ID: 7918f1c5-5031-47b8-b054-56f6f0c569a2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: gluster-host-01.fqdn:/mnt/gluster-storage/uat_storage
Brick2: gluster-host-02.fqdn:/mnt/gluster-storage/uat_storage
Brick3: gluster-host-03.fqdn:/mnt/gluster-storage/uat_storage
performance.rda-cache-limit: 256MB
network.inode-lru-limit: 50000
server.outstanding-rpc-limit: 256
performance.client-io-threads: true
nfs.disable: on
transport.address-family: inet
client.event-threads: 8
cluster.eager-lock: true
cluster.favorite-child-policy: size
cluster.lookup-optimize: true
cluster.readdir-optimize: true
cluster.use-compound-fops: true
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
features.cache-invalidation-timeout: 600
features.cache-invalidation: true
network.ping-timeout: 15
performance.cache-invalidation: true
performance.cache-max-file-size: 6MB
performance.cache-refresh-timeout: 60
performance.cache-size: 1024MB
performance.io <http://performance.io>-thread-count: 16
performance.md-cache-timeout: 600
performance.stat-prefetch: true
performance.write-behind-window-size: 256MB
server.event-threads: 8
transport.listen-backlog: 2048
meta-data=/dev/mapper/gluster-storage-unlocked isize=512    agcount=4,
agsize=196607360 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0 data
=                       bsize=4096   blocks=786429440, imaxpct=5
         =                       sunit=0      swidth=0 blks naming
=version 2              bsize=8192   ascii-ci=0 ftype=1 log
=internal               bsize=4096   blocks=383998, version=2
         =                       sectsz=512   sunit=0 blks,
lazy-count=1 realtime =none                   extsz=4096   blocks=0,
rtextents=0
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of my
employer or partners.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT) Kasteelpark Arenberg 10 bus 2440 - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>> _______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
-----
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.
Rik Theys
2018-03-19 14:52:42 UTC
Permalink
Hi,
Post by TomK
Removing NFS or NFS Ganesha from the equation, not very impressed on my
own setup either.  For the writes it's doing, that's alot of CPU usage
in top. Seems bottle-necked via a single execution core somewhere trying
to facilitate read / writes to the other bricks.
Writes to the gluster FS from within one of the gluster participating
393505+0 records in
393505+0 records out
201474560 bytes (201 MB) copied, 50.034 s, 4.0 MB/s
That's not really a fare comparison as you don't specify a blocksize.
What does

dd if=/dev/zero of=./some-file.bin bs=1M count=1000 oflag=direct

give?


Rik
--
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT)
Kasteelpark Arenberg 10 bus 2440 - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>>
TomK
2018-03-19 20:25:57 UTC
Permalink
Post by Rik Theys
Hi,
Post by TomK
Removing NFS or NFS Ganesha from the equation, not very impressed on my
own setup either.  For the writes it's doing, that's alot of CPU usage
in top. Seems bottle-necked via a single execution core somewhere trying
to facilitate read / writes to the other bricks.
Writes to the gluster FS from within one of the gluster participating
393505+0 records in
393505+0 records out
201474560 bytes (201 MB) copied, 50.034 s, 4.0 MB/s
That's not really a fare comparison as you don't specify a blocksize.
What does
dd if=/dev/zero of=./some-file.bin bs=1M count=1000 oflag=direct
give?
Rik
Correct. Higher block sizes gave me better numbers earlier. Curious
about improving the small file size performance though, preferrably via
gluster tunables, if possible.

Though it could be said I guess that compressing a set of large files
and transferring them over that way is one solution. However needed the
small block size on dd to perhaps quickly simulate alot of small
requests in a somewhat ok-ish way.

Here's the numbers from the VM:

[ Via Gluster ]
[***@nfs01 n]# dd if=/dev/zero of=./some-file.bin bs=1M count=10000
oflag=direct
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 96.3228 s, 109 MB/s
[***@nfs01 n]# rm some-file.bin
rm: remove regular file âsome-file.binâ? y

[ Via XFS ]
[***@nfs01 n]# cd /bricks/0/gv01/
[***@nfs01 gv01]# dd if=/dev/zero of=./some-file.bin bs=1M count=10000
oflag=direct
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 44.79 s, 234 MB/s
[***@nfs01 gv01]#



top - 12:49:48 up 1 day, 9:39, 2 users, load average: 0.66, 1.15, 1.82
Tasks: 165 total, 1 running, 164 sleeping, 0 stopped, 0 zombie
%Cpu0 : 10.3 us, 9.6 sy, 0.0 ni, 28.0 id, 50.4 wa, 0.0 hi, 1.8 si,
0.0 st
%Cpu1 : 13.8 us, 13.8 sy, 0.0 ni, 38.6 id, 30.0 wa, 0.0 hi, 3.8 si,
0.0 st
%Cpu2 : 8.7 us, 6.9 sy, 0.0 ni, 48.7 id, 34.9 wa, 0.0 hi, 0.7 si,
0.0 st
%Cpu3 : 10.6 us, 7.8 sy, 0.0 ni, 57.1 id, 24.1 wa, 0.0 hi, 0.4 si,
0.0 st
KiB Mem : 3881708 total, 3543280 free, 224008 used, 114420 buff/cache
KiB Swap: 4063228 total, 3836612 free, 226616 used. 3457708 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14115 root 20 0 2504832 27640 2612 S 43.5 0.7 432:10.35
glusterfsd
1319 root 20 0 1269620 23780 2636 S 38.9 0.6 752:44.78
glusterfs
1334 root 20 0 2694264 56988 1672 S 16.3 1.5 311:20.90
ganesha.nfsd
27458 root 20 0 108984 1404 540 D 3.0 0.0 0:00.24 dd
14127 root 20 0 1164720 4860 1960 S 0.7 0.1 1:47.59
glusterfs
750 root 20 0 389864 5528 3988 S 0.3 0.1 0:08.77 sssd_be
--
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.
Raghavendra Gowdappa
2018-03-20 02:55:53 UTC
Permalink
Post by TomK
Post by Rik Theys
Hi,
Post by TomK
Removing NFS or NFS Ganesha from the equation, not very impressed on my
own setup either. For the writes it's doing, that's alot of CPU usage
in top. Seems bottle-necked via a single execution core somewhere trying
to facilitate read / writes to the other bricks.
Writes to the gluster FS from within one of the gluster participating
393505+0 records in
393505+0 records out
201474560 bytes (201 MB) copied, 50.034 s, 4.0 MB/s
That's not really a fare comparison as you don't specify a blocksize.
What does
dd if=/dev/zero of=./some-file.bin bs=1M count=1000 oflag=direct
give?
Rik
Correct. Higher block sizes gave me better numbers earlier. Curious
about improving the small file size performance though, preferrably via
gluster tunables, if possible.
Though it could be said I guess that compressing a set of large files and
transferring them over that way is one solution. However needed the small
block size on dd to perhaps quickly simulate alot of small requests in a
somewhat ok-ish way.
Aggregating large number of small writes by write-behind into large writes
has been merged on master:
https://github.com/gluster/glusterfs/issues/364

Would like to know whether it helps for this usecase. Note that its not
part of any release yet. So you've to build and install from repo.

Another suggestion is to run tests with turning off option
performance.write-behind-trickling-writes.

# gluster volume set <volname> performance.write-behind-trickling-writes off

A word of caution though is if your files are too small, these suggestions
may not have much impact.
Post by TomK
[ Via Gluster ]
oflag=direct
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 96.3228 s, 109 MB/s
rm: remove regular file âsome-file.binâ? y
[ Via XFS ]
oflag=direct
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 44.79 s, 234 MB/s
top - 12:49:48 up 1 day, 9:39, 2 users, load average: 0.66, 1.15, 1.82
Tasks: 165 total, 1 running, 164 sleeping, 0 stopped, 0 zombie
%Cpu0 : 10.3 us, 9.6 sy, 0.0 ni, 28.0 id, 50.4 wa, 0.0 hi, 1.8 si,
0.0 st
%Cpu1 : 13.8 us, 13.8 sy, 0.0 ni, 38.6 id, 30.0 wa, 0.0 hi, 3.8 si,
0.0 st
%Cpu2 : 8.7 us, 6.9 sy, 0.0 ni, 48.7 id, 34.9 wa, 0.0 hi, 0.7 si,
0.0 st
%Cpu3 : 10.6 us, 7.8 sy, 0.0 ni, 57.1 id, 24.1 wa, 0.0 hi, 0.4 si,
0.0 st
KiB Mem : 3881708 total, 3543280 free, 224008 used, 114420 buff/cache
KiB Swap: 4063228 total, 3836612 free, 226616 used. 3457708 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14115 root 20 0 2504832 27640 2612 S 43.5 0.7 432:10.35
glusterfsd
1319 root 20 0 1269620 23780 2636 S 38.9 0.6 752:44.78
glusterfs
1334 root 20 0 2694264 56988 1672 S 16.3 1.5 311:20.90
ganesha.nfsd
27458 root 20 0 108984 1404 540 D 3.0 0.0 0:00.24 dd
14127 root 20 0 1164720 4860 1960 S 0.7 0.1 1:47.59
glusterfs
750 root 20 0 389864 5528 3988 S 0.3 0.1 0:08.77 sssd_be
--
Cheers,
Tom K.
------------------------------------------------------------
-------------------------
Living on earth is expensive, but it includes a free trip around the sun.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Sam McLeod
2018-03-20 03:27:00 UTC
Permalink
Hi Raghavendra,
https://github.com/gluster/glusterfs/issues/364 <https://github.com/gluster/glusterfs/issues/364>
Would like to know whether it helps for this usecase. Note that its not part of any release yet. So you've to build and install from repo.
Sounds interesting, not too keen to build packages at the moment but I've added myself as a watcher to that issue on Github and once it's in a 3.x release I'll try it and let you know.
Another suggestion is to run tests with turning off option performance.write-behind-trickling-writes.
# gluster volume set <volname> performance.write-behind-trickling-writes off
A word of caution though is if your files are too small, these suggestions may not have much impact.
I'm looking for documentation on this option but all I could really find is in the source for write-behind.c:

if is enabled (which it is), do not hold back writes if there are no outstanding requests.


and a note on aggregate-size stating that

"aggregation won't happen if performance.write-behind-trickling-writes is turned on"


What are the potentially negative performance impacts of disabling this?

--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod

Words are my own opinions and do not necessarily represent those of my employer or partners.
Raghavendra Gowdappa
2018-03-20 03:56:38 UTC
Permalink
Post by Sam McLeod
Hi Raghavendra,
Aggregating large number of small writes by write-behind into large writes
https://github.com/gluster/glusterfs/issues/364
Would like to know whether it helps for this usecase. Note that its not
part of any release yet. So you've to build and install from repo.
Sounds interesting, not too keen to build packages at the moment but I've
added myself as a watcher to that issue on Github and once it's in a 3.x
release I'll try it and let you know.
Another suggestion is to run tests with turning off option
performance.write-behind-trickling-writes.
# gluster volume set <volname> performance.write-behind-trickling-writes off
A word of caution though is if your files are too small, these suggestions
may not have much impact.
I'm looking for documentation on this option but all I could really find
if is enabled (which it is), do not hold back writes if there are no outstanding requests.
Till recently this functionality though was available, couldn't be
configured from cli. One could change this option by editing volume
configuration file. However, now its configurable through cli:

https://review.gluster.org/#/c/18719/
Post by Sam McLeod
and a note on aggregate-size stating that
*"aggregation won't happen if performance.write-behind-trickling-writes is
turned on"*
What are the potentially negative performance impacts of disabling this?
Even if aggregation option is turned off, write-behind has the capacity to
aggregate till a size of 128KB. But, to completely make use of this in case
of small write workloads write-behind has to wait for sometime so that
there are enough number of write-requests to fill the capacity. With this
option enabled, write-behind though aggregates existing requests, won't
wait for future writes. This means descendant xlators of write-behind can
see writes smaller than 128K. So, for a scenario where small number of
large writes are preferred over large number of small sized writes, this
can be a problem.
Post by Sam McLeod
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of
my employer or partners.
Sam McLeod
2018-03-20 04:15:28 UTC
Permalink
Excellent description, thank you.

With performance.write-behind-trickling-writes ON (default):

## 4k randwrite

# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=17.3MiB/s][r=0,w=4422 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=42701: Tue Mar 20 15:05:23 2018
write: IOPS=4443, BW=17.4MiB/s (18.2MB/s)(256MiB/14748msec)
bw ( KiB/s): min=16384, max=19184, per=99.92%, avg=17760.45, stdev=602.48, samples=29
iops : min= 4096, max= 4796, avg=4440.07, stdev=150.66, samples=29
cpu : usr=4.00%, sys=18.02%, ctx=131097, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwt: total=0,65536,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=17.4MiB/s (18.2MB/s), 17.4MiB/s-17.4MiB/s (18.2MB/s-18.2MB/s), io=256MiB (268MB), run=14748-14748msec


## 2k randwrite

# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=test --bs=2k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 2048B-2048B, (W) 2048B-2048B, (T) 2048B-2048B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=8624KiB/s][r=0,w=4312 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=42781: Tue Mar 20 15:05:57 2018
write: IOPS=4439, BW=8880KiB/s (9093kB/s)(256MiB/29522msec)
bw ( KiB/s): min= 6908, max= 9564, per=99.94%, avg=8874.03, stdev=428.92, samples=59
iops : min= 3454, max= 4782, avg=4437.00, stdev=214.44, samples=59
cpu : usr=2.43%, sys=18.18%, ctx=262222, majf=0, minf=8
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwt: total=0,131072,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=8880KiB/s (9093kB/s), 8880KiB/s-8880KiB/s (9093kB/s-9093kB/s), io=256MiB (268MB), run=29522-29522msec


With performance.write-behind-trickling-writes OFF:

## 4k randwrite - just over half the IOP/s of having it ON.


# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [f(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=44225: Tue Mar 20 15:11:04 2018
write: IOPS=2594, BW=10.1MiB/s (10.6MB/s)(256MiB/25259msec)
bw ( KiB/s): min= 2248, max=18728, per=100.00%, avg=10454.10, stdev=6481.14, samples=50
iops : min= 562, max= 4682, avg=2613.50, stdev=1620.35, samples=50
cpu : usr=2.29%, sys=10.09%, ctx=131141, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwt: total=0,65536,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=10.1MiB/s (10.6MB/s), 10.1MiB/s-10.1MiB/s (10.6MB/s-10.6MB/s), io=256MiB (268MB), run=25259-25259msec


## 2k randwrite - no noticable change.

# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=test --bs=2k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 2048B-2048B, (W) 2048B-2048B, (T) 2048B-2048B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=8662KiB/s][r=0,w=4331 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=45813: Tue Mar 20 15:12:02 2018
write: IOPS=4291, BW=8583KiB/s (8789kB/s)(256MiB/30541msec)
bw ( KiB/s): min= 7416, max=10264, per=99.94%, avg=8577.66, stdev=618.31, samples=61
iops : min= 3708, max= 5132, avg=4288.84, stdev=309.15, samples=61
cpu : usr=2.87%, sys=15.83%, ctx=262236, majf=0, minf=8
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwt: total=0,131072,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=8583KiB/s (8789kB/s), 8583KiB/s-8583KiB/s (8789kB/s-8789kB/s), io=256MiB (268MB), run=30541-30541msec


Let me know if you'd recommend any other benchmarks comparing performance.write-behind-trickling-writes ON/OFF (just nothing that'll seriously risk locking up the whole gluster cluster please!).


--
Sam McLeod
Please respond via email when possible.
https://smcleod.net
https://twitter.com/s_mcleod
Post by Sam McLeod
Hi Raghavendra,
https://github.com/gluster/glusterfs/issues/364 <https://github.com/gluster/glusterfs/issues/364>
Would like to know whether it helps for this usecase. Note that its not part of any release yet. So you've to build and install from repo.
Sounds interesting, not too keen to build packages at the moment but I've added myself as a watcher to that issue on Github and once it's in a 3.x release I'll try it and let you know.
Another suggestion is to run tests with turning off option performance.write-behind-trickling-writes.
# gluster volume set <volname> performance.write-behind-trickling-writes off
A word of caution though is if your files are too small, these suggestions may not have much impact.
if is enabled (which it is), do not hold back writes if there are no outstanding requests.
https://review.gluster.org/#/c/18719/ <https://review.gluster.org/#/c/18719/>
and a note on aggregate-size stating that
"aggregation won't happen if performance.write-behind-trickling-writes is turned on"
What are the potentially negative performance impacts of disabling this?
Even if aggregation option is turned off, write-behind has the capacity to aggregate till a size of 128KB. But, to completely make use of this in case of small write workloads write-behind has to wait for sometime so that there are enough number of write-requests to fill the capacity. With this option enabled, write-behind though aggregates existing requests, won't wait for future writes. This means descendant xlators of write-behind can see writes smaller than 128K. So, for a scenario where small number of large writes are preferred over large number of small sized writes, this can be a problem.
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net <https://smcleod.net/>
https://twitter.com/s_mcleod <https://twitter.com/s_mcleod>
Words are my own opinions and do not necessarily represent those of my employer or partners.
Raghavendra Gowdappa
2018-03-20 07:23:22 UTC
Permalink
Post by Sam McLeod
Excellent description, thank you.
## 4k randwrite
# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
--filename=test --bs=4k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=17.3MiB/s][r=0,w=4422 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=42701: Tue Mar 20 15:05:23 2018
write: *IOPS=4443*, *BW=17.4MiB/s* (18.2MB/s)(256MiB/14748msec)
bw ( KiB/s): min=16384, max=19184, per=99.92%, avg=17760.45, stdev=602.48, samples=29
iops : min= 4096, max= 4796, avg=4440.07, stdev=150.66, samples=29
cpu : usr=4.00%, sys=18.02%, ctx=131097, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
Post by Sam McLeod
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
issued rwt: total=0,65536,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
WRITE: bw=17.4MiB/s (18.2MB/s), 17.4MiB/s-17.4MiB/s (18.2MB/s-18.2MB/s),
io=256MiB (268MB), run=14748-14748msec
## 2k randwrite
# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
--filename=test --bs=2k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 2048B-2048B, (W) 2048B-2048B, (T)
2048B-2048B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=8624KiB/s][r=0,w=4312 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=42781: Tue Mar 20 15:05:57 2018
write: *IOPS=4439, BW=8880KiB/s* (9093kB/s)(256MiB/29522msec)
bw ( KiB/s): min= 6908, max= 9564, per=99.94%, avg=8874.03, stdev=428.92, samples=59
iops : min= 3454, max= 4782, avg=4437.00, stdev=214.44, samples=59
cpu : usr=2.43%, sys=18.18%, ctx=262222, majf=0, minf=8
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
Post by Sam McLeod
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
issued rwt: total=0,131072,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
WRITE: bw=8880KiB/s (9093kB/s), 8880KiB/s-8880KiB/s (9093kB/s-9093kB/s),
io=256MiB (268MB), run=29522-29522msec
## 4k randwrite - just over half the IOP/s of having it ON.
Note that since the workload is random write, no aggregation is possible.
So, there is no point in waiting for future writes and turning
trickling-writes on makes sense.

A better test to measure the impact of this option would be sequential
write workload. I guess smaller the writes, more pronounced one would see
the benefits of this option turned off.
Post by Sam McLeod
# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
--filename=test --bs=4k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [f(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=44225: Tue Mar 20 15:11:04 2018
write: *IOPS=2594, BW=10.1MiB/s* (10.6MB/s)(256MiB/25259msec)
bw ( KiB/s): min= 2248, max=18728, per=100.00%, avg=10454.10, stdev=6481.14, samples=50
iops : min= 562, max= 4682, avg=2613.50, stdev=1620.35, samples=50
cpu : usr=2.29%, sys=10.09%, ctx=131141, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
Post by Sam McLeod
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
issued rwt: total=0,65536,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
WRITE: bw=10.1MiB/s (10.6MB/s), 10.1MiB/s-10.1MiB/s (10.6MB/s-10.6MB/s),
io=256MiB (268MB), run=25259-25259msec
## 2k randwrite - no noticable change.
# fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
--filename=test --bs=2k --iodepth=32 --size=256MB --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 2048B-2048B, (W) 2048B-2048B, (T)
2048B-2048B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=8662KiB/s][r=0,w=4331 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=45813: Tue Mar 20 15:12:02 2018
write: *IOPS=4291, BW=8583KiB/s* (8789kB/s)(256MiB/30541msec)
bw ( KiB/s): min= 7416, max=10264, per=99.94%, avg=8577.66, stdev=618.31, samples=61
iops : min= 3708, max= 5132, avg=4288.84, stdev=309.15, samples=61
cpu : usr=2.87%, sys=15.83%, ctx=262236, majf=0, minf=8
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
Post by Sam McLeod
=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
Post by Sam McLeod
=64=0.0%
issued rwt: total=0,131072,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
WRITE: bw=8583KiB/s (8789kB/s), 8583KiB/s-8583KiB/s (8789kB/s-8789kB/s),
io=256MiB (268MB), run=30541-30541msec
Let me know if you'd recommend any other benchmarks
comparing performance.write-behind-trickling-writes ON/OFF (just nothing
that'll seriously risk locking up the whole gluster cluster please!).
--
Sam McLeod
Please respond via email when possible.
https://smcleod.net
https://twitter.com/s_mcleod
Post by Sam McLeod
Hi Raghavendra,
Aggregating large number of small writes by write-behind into large
https://github.com/gluster/glusterfs/issues/364
Would like to know whether it helps for this usecase. Note that its not
part of any release yet. So you've to build and install from repo.
Sounds interesting, not too keen to build packages at the moment but I've
added myself as a watcher to that issue on Github and once it's in a 3.x
release I'll try it and let you know.
Another suggestion is to run tests with turning off option
performance.write-behind-trickling-writes.
# gluster volume set <volname> performance.write-behind-trickling-writes off
A word of caution though is if your files are too small, these
suggestions may not have much impact.
I'm looking for documentation on this option but all I could really find
if is enabled (which it is), do not hold back writes if there are no
outstanding requests.
Till recently this functionality though was available, couldn't be
configured from cli. One could change this option by editing volume
https://review.gluster.org/#/c/18719/
Post by Sam McLeod
and a note on aggregate-size stating that
*"aggregation won't happen if performance.write-behind-trickling-writes
is turned on"*
What are the potentially negative performance impacts of disabling this?
Even if aggregation option is turned off, write-behind has the capacity to
aggregate till a size of 128KB. But, to completely make use of this in case
of small write workloads write-behind has to wait for sometime so that
there are enough number of write-requests to fill the capacity. With this
option enabled, write-behind though aggregates existing requests, won't
wait for future writes. This means descendant xlators of write-behind can
see writes smaller than 128K. So, for a scenario where small number of
large writes are preferred over large number of small sized writes, this
can be a problem.
Post by Sam McLeod
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of
my employer or partners.
Sam McLeod
2018-03-20 00:06:53 UTC
Permalink
Howdy all,

Sorry in Australia so most of your replies came in over night for me.

Note: At the end of this reply is a listing of all our volume settings (gluster get volname all).
Note 2: I really wish Gluster used Discourse for this kind of community troubleshooting an analysis, using a mailing list is really painful.
performance.cache-refresh-timeout Default: 1s
I've actually set this right up to 60 (seconds), I guess it's possible that's causing an issue but I thought that was more for forced eviction on idle files.
cluster.stripe-block-size Default: 128KB
Hmm yes I wonder if it might be worth looking at the stripe-block-size, I forgot about this as it sounds like it's for striped volumes (now deprecated) only.
The issue with this is that I don't want to tune the volume just for small files and hurt the performance of lager I/O operations.
http://lists.gluster.org/pipermail/gluster-users/2015-April/021487.html
GlusterFS 3.7 is really old so I'd be careful looking at settings / tuning for it.
nfs.trusted-sync: on
Not using NFS.
performance.cache-size: 1GB
Already set to 1024MB, but that's only for reads not writes.
performance.io-thread-count: 16
That's my current setting.
performance.write-behind-window-size: 8MB
Currently allowing even more cache up at 256MB.
performance.readdir-ahead: on
That's my current setting (the default now I believe).
client.event-threads: 8
That's my current setting (the default now I believe).
server.event-threads: 8
That's my current setting (the default now I believe).
cluster.quorum-type: auto
Not sure how that's going to impact small I/O performance.
I currently have this set to none, but do use an arbiter node.
cluster.server-quorum-type: server
Not sure how that's going to impact small I/O performance.
I currently have this set to off, but do use an arbiter node.
cluster.server-quorum-ratio: 51%
Not sure how that's going to impact small I/O performance.
I currently have this set to 0, but do use an arbiter node.
net.ipv4.tcp_slow_start_after_idle = 0
That's my current setting.
net.ipv4.tcp_fin_timeout = 15
I've set this right down to 5.
net.core.somaxconn = 65535
That's my current setting.
vm.swappiness = 1
That's my current setting, we don't have swap - other than ZRAM enabled on any hosts.
vm.dirty_ratio = 5
N/A as swap disabled (ZRAM only)
vm.dirty_background_ratio = 2
N/A as swap disabled (ZRAM only)
vm.min_free_kbytes = 524288 # this is on 128GB RAM
I have this set to vm.min_free_kbytes = 67584, I'd be worried that setting this high would cause OOM as per the official kernel docs:

min_free_kbytes:

This is used to force the Linux VM to keep a minimum number
of kilobytes free. The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.
That's not really a fare comparison as you don't specify a blocksize.
What does
dd if=/dev/zero of=./some-file.bin bs=1M count=1000 oflag=direct
give?
Rik
DD is not going to give anyone particularly useful benchmarks, especially with small file sizes, in fact it's more likely to mislead you than be useful.
See my short post on fio here: https://smcleod.net/tech/2016/04/29/benchmarking-io.html <https://smcleod.net/tech/2016/04/29/benchmarking-io.html> , I believe it's one of the most useful tools for I/O benchmarking.

Just for a laugh I compared dd writes for 4k (small) writes between the client (gluster mounted on the cli) and a gluster host (to a directory on the same storage as the bricks).
The client came out faster, likely the direct I/O flag was not working as perhaps intended.

Client:

# dd if=/dev/zero of=./some-file.bin bs=4K count=4096 oflag=direct
4096+0 records in
4096+0 records out
16777216 bytes (17 MB) copied, 2.27839 s, 7.4 MB/s

Server:

dd if=/dev/zero of=./some-file.bin bs=4K count=4096 oflag=direct
4096+0 records in
4096+0 records out
16777216 bytes (17 MB) copied, 3.94093 s, 4.3 MB/s
Note: At the end of this reply is a listing of all our volume settings (gluster get volname all).
Here is an output of all gluster volume settings as they currently stand:


# gluster volume get uat_storage all
Option Value
------ -----
cluster.lookup-unhashed on
cluster.lookup-optimize true
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize true
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.lock-migration off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal on
cluster.data-self-heal on
cluster.entry-self-heal on
cluster.self-heal-daemon on
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock true
disperse.eager-lock on
disperse.other-eager-lock on
cluster.quorum-type none
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.favorite-child-policy size
cluster.full-lock yes
cluster.stripe-block-size 128KB
cluster.stripe-coalesce true
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level ERROR
diagnostics.client-log-level ERROR
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
diagnostics.stats-dump-interval 0
diagnostics.fop-sample-interval 0
diagnostics.stats-dump-format json
diagnostics.fop-sample-buf-size 65535
diagnostics.stats-dnscache-ttl-sec 86400
performance.cache-max-file-size 6MB
performance.cache-min-file-size 0
performance.cache-refresh-timeout 60
performance.cache-priority
performance.cache-size 1024MB
performance.io-thread-count 16
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.cache-size 1024MB
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 256MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.write-behind-trickling-writeson
performance.nfs.write-behind-trickling-writeson
performance.lazy-open yes
performance.read-after-open no
performance.read-ahead-page-count 4
performance.md-cache-timeout 600
performance.cache-swift-metadata true
performance.cache-samba-metadata false
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
features.encryption off
encryption.master-key (null)
encryption.data-key-size 256
encryption.block-size 4096
network.frame-timeout 1800
network.ping-timeout 15
network.tcp-window-size (null)
features.lock-heal off
features.grace-timeout 10
network.remote-dio disable
client.event-threads 8
client.tcp-user-timeout 0
client.keepalive-time 20
client.keepalive-interval 2
client.keepalive-count 9
network.tcp-window-size (null)
network.inode-lru-limit 50000
auth.allow *
auth.reject (null)
transport.keepalive 1
server.allow-insecure (null)
server.root-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 256
features.lock-heal off
features.grace-timeout 10
server.ssl (null)
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 8
server.tcp-user-timeout 0
server.keepalive-time 20
server.keepalive-interval 2
server.keepalive-count 9
transport.listen-backlog 2048
ssl.own-cert (null)
ssl.private-key (null)
ssl.ca-list (null)
ssl.crl-path (null)
ssl.certificate-depth (null)
ssl.cipher-list (null)
ssl.dh-param (null)
ssl.ec-curve (null)
transport.address-family inet
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead on
performance.io-cache on
performance.quick-read on
performance.open-behind on
performance.nl-cache off
performance.stat-prefetch true
performance.client-io-threads true
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache off
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
performance.cache-invalidation true
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.limit-usage (null)
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.disable on
features.read-only off
features.worm off
features.worm-file-level off
features.worm-files-deletable on
features.default-retention-period 120
features.retention-mode relax
features.auto-commit-period 180
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.gfid2path on
storage.gfid2path-separator :
storage.reserve 1
storage.bd-aio off
config.gfproxyd off
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
changelog.changelog off
changelog.changelog-dir (null)
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5MB
features.trash-internal-op off
cluster.enable-shared-storage disable
cluster.write-freq-threshold 0
cluster.read-freq-threshold 0
cluster.tier-pause off
cluster.tier-promote-frequency 120
cluster.tier-demote-frequency 3600
cluster.watermark-hi 90
cluster.watermark-low 75
cluster.tier-mode cache
cluster.tier-max-promote-file-size 0
cluster.tier-max-mb 4000
cluster.tier-max-files 10000
cluster.tier-query-limit 100
cluster.tier-compact on
cluster.tier-hot-compact-frequency 604800
cluster.tier-cold-compact-frequency 604800
features.ctr-enabled off
features.record-counters off
features.ctr-record-metadata-heat off
features.ctr_link_consistency off
features.ctr_lookupheal_link_timeout 300
features.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-cachesize 12500
features.ctr-sql-db-wal-autocheckpoint 25000
features.selinux on
locks.trace off
locks.mandatory-locking off
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
features.shard off
features.shard-block-size 64MB
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation true
features.cache-invalidation-timeout 600
features.leases off
features.lease-lock-recall-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
dht.force-readdirp on
disperse.read-policy round-robin
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
cluster.granular-entry-heal no
features.locks-revocation-secs 0
features.locks-revocation-clear-all false
features.locks-revocation-max-blocked 0
features.locks-monkey-unlocking false
disperse.shd-max-threads 1
disperse.shd-wait-qlength 1024
disperse.cpu-extensions auto
disperse.self-heal-window-size 1
cluster.use-compound-fops true
performance.parallel-readdir off
performance.rda-request-size 131072
performance.rda-low-wmark 4096
performance.rda-high-wmark 128KB
performance.rda-cache-limit 256MB
performance.nl-cache-positive-entry false
performance.nl-cache-limit 10MB
performance.nl-cache-timeout 60
cluster.brick-multiplex off
cluster.max-bricks-per-process 0
disperse.optimistic-change-log on
cluster.halo-enabled False
cluster.halo-shd-max-latency 99999
cluster.halo-nfsd-max-latency 5
cluster.halo-max-latency 5
cluster.halo-max-replicas 99999
cluster.halo-min-replicas 2
debug.delay-gen off
delay-gen.delay-percentage 10%
delay-gen.delay-duration 100000
delay-gen.enable (null)
disperse.parallel-writes on


--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod

Words are my own opinions and do not necessarily represent those of my employer or partners.
Loading...