[Gluster-users] Gluter 3.12.12: performance during heal and in general

Discussion:

Hu Bert

2018-07-19 06:31:22 UTC

Hi there,

sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)

We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
setup:

3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.

About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.

After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...

Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?

Thx in advance :-)

gluster volume status

Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
Options Reconfigured:
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128

Hu Bert

2018-07-20 07:41:35 UTC

Permalink

hmm... no one any idea?

Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
yesterday, most of them (current log output) looking like this:

[2018-07-20 07:30:49.757595] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1

or like this:

[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba

is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.

Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128

Hu Bert

2018-07-23 10:46:07 UTC

Permalink

Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.

Why is the performance that bad? No chance of speeding this up?

Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.

Pranith Kumar Karampuri

2018-07-24 08:40:56 UTC

Permalink

Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?

What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?

Based on this data we can see if anything can be improved. Or if there are
some
enhancements that need to be implemented in gluster to address this kind of
data layout

Post by Hu Bert

_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Pranith

Hu Bert

2018-07-26 05:10:09 UTC

Permalink

Hi Pranith,

Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.

Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?

We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.

There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).

files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).

Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.

Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.

volume name: shared
mount point on clients: /data/repository/shared/
below /shared/ there are 2 directories:
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB

We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)

directory structure for the images (i'll omit some other miscellaneous
stuff, but it looks quite similar):
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg

That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.

i hope there's something we can do to raise performance a bit. thx in
advance :-)