Discussion:
[Gluster-users] Gluter 3.12.12: performance during heal and in general
Hu Bert
2018-07-19 06:31:22 UTC
Permalink
Hi there,

sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)

We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
setup:

3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.

About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.

After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...

Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?


Thx in advance :-)

gluster volume status

Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
Options Reconfigured:
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
Hu Bert
2018-07-20 07:41:35 UTC
Permalink
hmm... no one any idea?

Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
yesterday, most of them (current log output) looking like this:

[2018-07-20 07:30:49.757595] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1

or like this:

[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal] 0-shared-replicate-3:
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba

is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
Hu Bert
2018-07-23 10:46:07 UTC
Permalink
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.

Why is the performance that bad? No chance of speeding this up?
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
Pranith Kumar Karampuri
2018-07-24 08:40:56 UTC
Permalink
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?

Based on this data we can see if anything can be improved. Or if there are
some
enhancements that need to be implemented in gluster to address this kind of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
Hu Bert
2018-07-26 05:10:09 UTC
Permalink
Hi Pranith,

Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.

There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).

files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).

Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.

Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.

volume name: shared
mount point on clients: /data/repository/shared/
below /shared/ there are 2 directories:
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB

We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)

directory structure for the images (i'll omit some other miscellaneous
stuff, but it looks quite similar):
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg

That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.

i hope there's something we can do to raise performance a bit. thx in
advance :-)
Post by Pranith Kumar Karampuri
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if there are
some
enhancements that need to be implemented in gluster to address this kind of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
Pranith Kumar Karampuri
2018-07-26 06:56:22 UTC
Permalink
Thanks a lot for detailed write-up, this helps find the bottlenecks easily.
On a high level, to handle this directory hierarchy i.e. lots of
directories with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
enhancements:

1) At the moment directories are healed one at a time, but files can be
healed upto 64 in parallel per replica subvolume.
So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number of
files in parallel.

I raised https://github.com/gluster/glusterfs/issues/477 to track this. In
the mean-while you can use the following work-around:
a) Increase background heals on the mount:
gluster volume set <volname> cluster.background-self-heal-count 256
gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000
find <mnt> -type d | xargs stat

one 'find' will trigger 10256 directories. So you may have to do this
periodically until all directories are healed.

2) Self-heal heals a file 128KB at a time(data-self-heal-window-size). I
think for your environment bumping upto MBs is better. Say 2MB i.e.
16*128KB?

Command to do that is:
gluster volume set <volname> cluster.data-self-heal-window-size 16
Post by Hu Bert
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.
There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).
files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).
Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.
Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.
volume name: shared
mount point on clients: /data/repository/shared/
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB
We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)
directory structure for the images (i'll omit some other miscellaneous
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg
That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.
i hope there's something we can do to raise performance a bit. thx in
advance :-)
Post by Pranith Kumar Karampuri
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if there
are
Post by Pranith Kumar Karampuri
some
enhancements that need to be implemented in gluster to address this kind
of
Post by Pranith Kumar Karampuri
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't
archived,
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the
other
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information
needed?
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
--
Pranith
Hu Bert
2018-07-26 07:29:56 UTC
Permalink
Hi Pranith,

thanks a lot for your efforts and for tracking "my" problem with an issue. :-)

I've set this params on the gluster volume and will start the
'find...' command within a short time. I'll probably add another
answer to the list to document the progress.

btw. - you had some typos:
gluster volume set <volname> cluster.cluster.heal-wait-queue-length
10000 => cluster is doubled
gluster volume set <volname> cluster.data-self-heal-window-size 16 =>
it's actually cluster.self-heal-window-size

but actually no problem :-)

Just curious: would gluster 4.1 improve the performance for healing
and in general for "my" scenario?
Post by Pranith Kumar Karampuri
Thanks a lot for detailed write-up, this helps find the bottlenecks easily.
On a high level, to handle this directory hierarchy i.e. lots of directories
with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
1) At the moment directories are healed one at a time, but files can be
healed upto 64 in parallel per replica subvolume.
So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number of
files in parallel.
I raised https://github.com/gluster/glusterfs/issues/477 to track this. In
gluster volume set <volname> cluster.background-self-heal-count 256
gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000
find <mnt> -type d | xargs stat
one 'find' will trigger 10256 directories. So you may have to do this
periodically until all directories are healed.
2) Self-heal heals a file 128KB at a time(data-self-heal-window-size). I
think for your environment bumping upto MBs is better. Say 2MB i.e.
16*128KB?
gluster volume set <volname> cluster.data-self-heal-window-size 16
Post by Hu Bert
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.
There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).
files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).
Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.
Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.
volume name: shared
mount point on clients: /data/repository/shared/
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB
We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)
directory structure for the images (i'll omit some other miscellaneous
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg
That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.
i hope there's something we can do to raise performance a bit. thx in
advance :-)
Post by Pranith Kumar Karampuri
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if there are
some
enhancements that need to be implemented in gluster to address this kind of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-07-26 08:17:16 UTC
Permalink
Post by Hu Bert
Hi Pranith,
thanks a lot for your efforts and for tracking "my" problem with an issue. :-)
I've set this params on the gluster volume and will start the
'find...' command within a short time. I'll probably add another
answer to the list to document the progress.
gluster volume set <volname> cluster.cluster.heal-wait-queue-length
10000 => cluster is doubled
gluster volume set <volname> cluster.data-self-heal-window-size 16 =>
it's actually cluster.self-heal-window-size
but actually no problem :-)
Sorry, bad copy/paste :-(.
Post by Hu Bert
Just curious: would gluster 4.1 improve the performance for healing
and in general for "my" scenario?
No, this issue is present in all the existing releases. But it is solvable.
You can follow that issue to see progress and when it is fixed etc.
Post by Hu Bert
Post by Pranith Kumar Karampuri
Thanks a lot for detailed write-up, this helps find the bottlenecks
easily.
Post by Pranith Kumar Karampuri
On a high level, to handle this directory hierarchy i.e. lots of
directories
Post by Pranith Kumar Karampuri
with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
1) At the moment directories are healed one at a time, but files can be
healed upto 64 in parallel per replica subvolume.
So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number
of
Post by Pranith Kumar Karampuri
files in parallel.
I raised https://github.com/gluster/glusterfs/issues/477 to track this.
In
Post by Pranith Kumar Karampuri
gluster volume set <volname> cluster.background-self-heal-count 256
gluster volume set <volname> cluster.cluster.heal-wait-queue-length
10000
Post by Pranith Kumar Karampuri
find <mnt> -type d | xargs stat
one 'find' will trigger 10256 directories. So you may have to do this
periodically until all directories are healed.
2) Self-heal heals a file 128KB at a time(data-self-heal-window-size). I
think for your environment bumping upto MBs is better. Say 2MB i.e.
16*128KB?
gluster volume set <volname> cluster.data-self-heal-window-size 16
Post by Hu Bert
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.
There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).
files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).
Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.
Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.
volume name: shared
mount point on clients: /data/repository/shared/
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB
We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)
directory structure for the images (i'll omit some other miscellaneous
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID.jpg
Post by Pranith Kumar Karampuri
Post by Hu Bert
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg
That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.
i hope there's something we can do to raise performance a bit. thx in
advance :-)
Post by Pranith Kumar Karampuri
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max
2
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if there are
some
enhancements that need to be implemented in gluster to address this
kind
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so
far
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
~220 GB were copied. On the other 2 servers i see a lot of entries
in
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on 0d863a62-0dd8-401c-b699-
2b642d9fd2b6.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-
04b1ea4458ba.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS
on
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different
servers);
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
disk were replaced, were brought back into the volume and full
self
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
heal started. But the speed for this is quite... disappointing.
Each
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours
(48
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one
or
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
two of the nodes (gluster11, gluster12) was up to 1200%, consumed
by
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly
configured?
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-07-26 09:11:28 UTC
Permalink
Post by Pranith Kumar Karampuri
Sorry, bad copy/paste :-(.
np :-)

The question regarding version 4.1 was meant more generally: does
gluster v4.0 etc. have a better performance than version 3.12 etc.?
Just curious :-) Sooner or later we have to upgrade anyway.

btw.: gluster12 was the node with the failed brick, and i started the
full heal on this node (has the biggest uuid as well). Is it normal
that the glustershd.log on this node is rather empty (some hundred
entries), but the glustershd.log files on the 2 other nodes have
hundreds of thousands of entries?

(sry, mail twice, didn't go to the list, but maybe others are
interested... :-) )
Post by Pranith Kumar Karampuri
Post by Hu Bert
Hi Pranith,
thanks a lot for your efforts and for tracking "my" problem with an issue. :-)
I've set this params on the gluster volume and will start the
'find...' command within a short time. I'll probably add another
answer to the list to document the progress.
gluster volume set <volname> cluster.cluster.heal-wait-queue-length
10000 => cluster is doubled
gluster volume set <volname> cluster.data-self-heal-window-size 16 =>
it's actually cluster.self-heal-window-size
but actually no problem :-)
Sorry, bad copy/paste :-(.
Post by Hu Bert
Just curious: would gluster 4.1 improve the performance for healing
and in general for "my" scenario?
No, this issue is present in all the existing releases. But it is solvable.
You can follow that issue to see progress and when it is fixed etc.
Post by Hu Bert
Post by Pranith Kumar Karampuri
Thanks a lot for detailed write-up, this helps find the bottlenecks easily.
On a high level, to handle this directory hierarchy i.e. lots of directories
with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
1) At the moment directories are healed one at a time, but files can be
healed upto 64 in parallel per replica subvolume.
So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number of
files in parallel.
I raised https://github.com/gluster/glusterfs/issues/477 to track this. In
gluster volume set <volname> cluster.background-self-heal-count 256
gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000
find <mnt> -type d | xargs stat
one 'find' will trigger 10256 directories. So you may have to do this
periodically until all directories are healed.
2) Self-heal heals a file 128KB at a time(data-self-heal-window-size). I
think for your environment bumping upto MBs is better. Say 2MB i.e.
16*128KB?
gluster volume set <volname> cluster.data-self-heal-window-size 16
Post by Hu Bert
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.
There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).
files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).
Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.
Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.
volume name: shared
mount point on clients: /data/repository/shared/
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB
We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)
directory structure for the images (i'll omit some other miscellaneous
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg
That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.
i hope there's something we can do to raise performance a bit. thx in
advance :-)
2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if there
are
some
enhancements that need to be implemented in gluster to address this kind
of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the server with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...
Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-07-26 09:29:10 UTC
Permalink
Post by Hu Bert
Post by Pranith Kumar Karampuri
Sorry, bad copy/paste :-(.
np :-)
The question regarding version 4.1 was meant more generally: does
gluster v4.0 etc. have a better performance than version 3.12 etc.?
Just curious :-) Sooner or later we have to upgrade anyway.
You can check what changed @
https://github.com/gluster/glusterfs/blob/release-4.0/doc/release-notes/4.0.0.md#performance
https://github.com/gluster/glusterfs/blob/release-4.1/doc/release-notes/4.1.0.md#performance
Post by Hu Bert
btw.: gluster12 was the node with the failed brick, and i started the
full heal on this node (has the biggest uuid as well). Is it normal
that the glustershd.log on this node is rather empty (some hundred
entries), but the glustershd.log files on the 2 other nodes have
hundreds of thousands of entries?
heals happen on the good bricks, so this is expected.
Post by Hu Bert
(sry, mail twice, didn't go to the list, but maybe others are
interested... :-) )
Post by Pranith Kumar Karampuri
Post by Hu Bert
Hi Pranith,
thanks a lot for your efforts and for tracking "my" problem with an
issue.
Post by Pranith Kumar Karampuri
Post by Hu Bert
:-)
I've set this params on the gluster volume and will start the
'find...' command within a short time. I'll probably add another
answer to the list to document the progress.
gluster volume set <volname> cluster.cluster.heal-wait-queue-length
10000 => cluster is doubled
gluster volume set <volname> cluster.data-self-heal-window-size 16 =>
it's actually cluster.self-heal-window-size
but actually no problem :-)
Sorry, bad copy/paste :-(.
Post by Hu Bert
Just curious: would gluster 4.1 improve the performance for healing
and in general for "my" scenario?
No, this issue is present in all the existing releases. But it is
solvable.
Post by Pranith Kumar Karampuri
You can follow that issue to see progress and when it is fixed etc.
Post by Hu Bert
Post by Pranith Kumar Karampuri
Thanks a lot for detailed write-up, this helps find the bottlenecks easily.
On a high level, to handle this directory hierarchy i.e. lots of directories
with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
1) At the moment directories are healed one at a time, but files can
be
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
healed upto 64 in parallel per replica subvolume.
So if you have nX2 or nX3 distributed subvolumes, it can heal 64n
number
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
of
files in parallel.
I raised https://github.com/gluster/glusterfs/issues/477 to track
this.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
In
gluster volume set <volname> cluster.background-self-heal-count 256
gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000
find <mnt> -type d | xargs stat
one 'find' will trigger 10256 directories. So you may have to do this
periodically until all directories are healed.
2) Self-heal heals a file 128KB at a time(data-self-heal-window-size).
I
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
think for your environment bumping upto MBs is better. Say 2MB i.e.
16*128KB?
gluster volume set <volname> cluster.data-self-heal-window-size 16
Post by Hu Bert
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer
your
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.
There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).
files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).
Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.
Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.
volume name: shared
mount point on clients: /data/repository/shared/
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB
We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)
directory structure for the images (i'll omit some other
miscellaneous
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg
That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.
i hope there's something we can do to raise performance a bit. thx in
advance :-)
2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within
max
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if there
are
some
enhancements that need to be implemented in gluster to address this kind
of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of
entries
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed data selfheal on 0d863a62-0dd8-401c-b699-
2b642d9fd2b6.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the
server
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't
archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are
experiencing a
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
quite poor performance. It got even worse when within a couple
of
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
weeks 2 bricks (disks) crashed. Maybe some general information
of
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4,
OS
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
on
separate disks); each server has 4 10TB disks -> each is a
brick;
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different
servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing.
Each
brick has ~1.6TB of data on it (mostly the infamous small
files).
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on
one
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
or
two of the nodes (gluster11, gluster12) was up to 1200%,
consumed
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on
the
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
other
2 ones...
Well... am i doing something wrong? Some options wrongly
configured?
Terrible setup? Anyone got an idea? Any additional information
needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-07-27 05:41:53 UTC
Permalink
Good Morning :-)

on server gluster11 about 1.25 million and on gluster13 about 1.35
million log entries in glustershd.log file. About 70 GB got healed,
overall ~700GB of 2.0TB. Doesn't seem to run faster. I'm calling
'find...' whenever i notice that it has finished. Hmm... is it
possible and reasonable to run 2 finds in parallel, maybe on different
subdirectories? E.g. running one one $volume/public/ and on one
$volume/private/ ?
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Sorry, bad copy/paste :-(.
np :-)
The question regarding version 4.1 was meant more generally: does
gluster v4.0 etc. have a better performance than version 3.12 etc.?
Just curious :-) Sooner or later we have to upgrade anyway.
https://github.com/gluster/glusterfs/blob/release-4.0/doc/release-notes/4.0.0.md#performance
https://github.com/gluster/glusterfs/blob/release-4.1/doc/release-notes/4.1.0.md#performance
Post by Hu Bert
btw.: gluster12 was the node with the failed brick, and i started the
full heal on this node (has the biggest uuid as well). Is it normal
that the glustershd.log on this node is rather empty (some hundred
entries), but the glustershd.log files on the 2 other nodes have
hundreds of thousands of entries?
heals happen on the good bricks, so this is expected.
Post by Hu Bert
(sry, mail twice, didn't go to the list, but maybe others are
interested... :-) )
Post by Pranith Kumar Karampuri
Post by Hu Bert
Hi Pranith,
thanks a lot for your efforts and for tracking "my" problem with an
issue.
:-)
I've set this params on the gluster volume and will start the
'find...' command within a short time. I'll probably add another
answer to the list to document the progress.
gluster volume set <volname> cluster.cluster.heal-wait-queue-length
10000 => cluster is doubled
gluster volume set <volname> cluster.data-self-heal-window-size 16 =>
it's actually cluster.self-heal-window-size
but actually no problem :-)
Sorry, bad copy/paste :-(.
Post by Hu Bert
Just curious: would gluster 4.1 improve the performance for healing
and in general for "my" scenario?
No, this issue is present in all the existing releases. But it is solvable.
You can follow that issue to see progress and when it is fixed etc.
Post by Hu Bert
2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Thanks a lot for detailed write-up, this helps find the bottlenecks easily.
On a high level, to handle this directory hierarchy i.e. lots of directories
with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
1) At the moment directories are healed one at a time, but files can be
healed upto 64 in parallel per replica subvolume.
So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number
of
files in parallel.
I raised https://github.com/gluster/glusterfs/issues/477 to track this.
In
gluster volume set <volname> cluster.background-self-heal-count 256
gluster volume set <volname> cluster.cluster.heal-wait-queue-length 10000
find <mnt> -type d | xargs stat
one 'find' will trigger 10256 directories. So you may have to do this
periodically until all directories are healed.
2) Self-heal heals a file 128KB at a
time(data-self-heal-window-size). I
think for your environment bumping upto MBs is better. Say 2MB i.e.
16*128KB?
gluster volume set <volname> cluster.data-self-heal-window-size 16
Post by Hu Bert
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some binaries.
There are about 190.000 directories in the file system; maybe there
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).
files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).
Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.
Directory hierarchy: maximum depth as seen from within the volume is
6, the average should be 3.
volume name: shared
mount point on clients: /data/repository/shared/
- public/: mainly calculated images (file sizes from a few KB up to
max 1 MB) and some resouces (small PNGs with a size of a few hundred
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB
We migrated from a NFS server (SPOF) to glusterfs and simply copied
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)
directory structure for the images (i'll omit some other miscellaneous
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg
That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our structure
at least a bit more transparent.
i hope there's something we can do to raise performance a bit. thx in
advance :-)
2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max
2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if there
are
some
enhancements that need to be implemented in gluster to address this
kind
of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so
far
~220 GB were copied. On the other 2 servers i see a lot of entries
in
glustershd.log, about 312.000 respectively 336.000 entries there
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed data selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the
server
with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't
archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information
of
our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS
on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian
stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different
servers);
disk were replaced, were brought back into the volume and full
self
heal started. But the speed for this is quite...
disappointing.
Each
brick has ~1.6TB of data on it (mostly the infamous small
files).
The
full heal i started yesterday copied only ~50GB within 24 hours
(48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month ago,
took
about 3 weeks) finished we had a terrible performance; CPU on one
or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed
by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the
other
2 ones...
Well... am i doing something wrong? Some options wrongly
configured?
Terrible setup? Anyone got an idea? Any additional information
needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-07-27 05:55:54 UTC
Permalink
Post by Hu Bert
Good Morning :-)
on server gluster11 about 1.25 million and on gluster13 about 1.35
million log entries in glustershd.log file. About 70 GB got healed,
overall ~700GB of 2.0TB. Doesn't seem to run faster. I'm calling
'find...' whenever i notice that it has finished. Hmm... is it
possible and reasonable to run 2 finds in parallel, maybe on different
subdirectories? E.g. running one one $volume/public/ and on one
$volume/private/ ?
Do you already have all the 190000 directories already created? If not
could you find out which of the paths need it and do a stat directly
instead of find?
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Sorry, bad copy/paste :-(.
np :-)
The question regarding version 4.1 was meant more generally: does
gluster v4.0 etc. have a better performance than version 3.12 etc.?
Just curious :-) Sooner or later we have to upgrade anyway.
https://github.com/gluster/glusterfs/blob/release-4.0/
doc/release-notes/4.0.0.md#performance
Post by Pranith Kumar Karampuri
https://github.com/gluster/glusterfs/blob/release-4.1/
doc/release-notes/4.1.0.md#performance
Post by Pranith Kumar Karampuri
Post by Hu Bert
btw.: gluster12 was the node with the failed brick, and i started the
full heal on this node (has the biggest uuid as well). Is it normal
that the glustershd.log on this node is rather empty (some hundred
entries), but the glustershd.log files on the 2 other nodes have
hundreds of thousands of entries?
heals happen on the good bricks, so this is expected.
Post by Hu Bert
(sry, mail twice, didn't go to the list, but maybe others are
interested... :-) )
Post by Pranith Kumar Karampuri
Post by Hu Bert
Hi Pranith,
thanks a lot for your efforts and for tracking "my" problem with an
issue.
:-)
I've set this params on the gluster volume and will start the
'find...' command within a short time. I'll probably add another
answer to the list to document the progress.
gluster volume set <volname> cluster.cluster.heal-wait-queue-length
10000 => cluster is doubled
gluster volume set <volname> cluster.data-self-heal-window-size 16
=>
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
it's actually cluster.self-heal-window-size
but actually no problem :-)
Sorry, bad copy/paste :-(.
Post by Hu Bert
Just curious: would gluster 4.1 improve the performance for healing
and in general for "my" scenario?
No, this issue is present in all the existing releases. But it is solvable.
You can follow that issue to see progress and when it is fixed etc.
Post by Hu Bert
2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Thanks a lot for detailed write-up, this helps find the bottlenecks easily.
On a high level, to handle this directory hierarchy i.e. lots of
directories
with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
1) At the moment directories are healed one at a time, but files
can
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
be
healed upto 64 in parallel per replica subvolume.
So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number
of
files in parallel.
I raised https://github.com/gluster/glusterfs/issues/477 to track this.
In
gluster volume set <volname> cluster.background-self-heal-count
256
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
gluster volume set <volname> cluster.cluster.heal-wait-
queue-length
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
10000
find <mnt> -type d | xargs stat
one 'find' will trigger 10256 directories. So you may have to do
this
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
periodically until all directories are healed.
2) Self-heal heals a file 128KB at a
time(data-self-heal-window-size). I
think for your environment bumping upto MBs is better. Say 2MB i.e.
16*128KB?
gluster volume set <volname> cluster.data-self-heal-window-size 16
Post by Hu Bert
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
Post by Pranith Kumar Karampuri
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
We have mostly images (more than 95% of disk usage, 90% of file
count), some text files (like css, jsp, gpx etc.) and some
binaries.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
There are about 190.000 directories in the file system; maybe
there
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
are some more because we're hit by bug 1512371 (parallel-readdir =
TRUE prevents directories listing). But the number of directories
could/will rise in the future (maybe millions).
files per directory: ranges from 0 to 100, on average it should be 20
files per directory (well, at least in the deepest dirs, see
explanation below).
Average filesize: ranges from a few hundred bytes up to 30 MB, on
average it should be 2-3 MB.
Directory hierarchy: maximum depth as seen from within the volume
is
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
6, the average should be 3.
volume name: shared
mount point on clients: /data/repository/shared/
- public/: mainly calculated images (file sizes from a few KB up
to
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
max 1 MB) and some resouces (small PNGs with a size of a few
hundred
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
bytes).
- private/: mainly source images; file sizes from 50 KB up to 30MB
We migrated from a NFS server (SPOF) to glusterfs and simply
copied
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
our files. The images (which have an ID) are stored in the deepest
directories of the dir tree. I'll better explain it :-)
directory structure for the images (i'll omit some other miscellaneous
- ID of an image has 7 or 8 digits
- /shared/private/: /(first 3 digits of ID)/(next 3 digits of ID)/$ID.jpg
- /shared/public/: /(first 3 digits of ID)/(next 3 digits of
ID)/$ID/$misc_formats.jpg
That's why we have that many (sub-)directories. Files are only stored
in the lowest directory hierarchy. I hope i could make our
structure
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
at least a bit more transparent.
i hope there's something we can do to raise performance a bit. thx in
advance :-)
2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert <
Post by Hu Bert
Well, over the weekend about 200GB were copied, so now there
are
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
~400GB copied to the brick. That's far beyond a speed of 10GB
per
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
hour. If I copied the 1.6 TB directly, that would be done
within
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
max
2
days. But with the self heal this will take at least 20 days
minimum.
Why is the performance that bad? No chance of speeding this up?
What kind of data do you have?
How many directories in the filesystem?
On average how many files per directory?
What is the depth of your directory hierarchy on average?
What is average filesize?
Based on this data we can see if anything can be improved. Or if
there
are
some
enhancements that need to be implemented in gluster to address this
kind
of
data layout
Post by Hu Bert
Post by Hu Bert
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed,
so
far
~220 GB were copied. On the other 2 servers i see a lot of
entries
in
glustershd.log, about 312.000 respectively 336.000 entries
there
yesterday, most of them (current log output) looking like
[2018-07-20 07:30:49.757595] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed data selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:30:49.992398] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6
[2018-07-20 07:30:50.243551] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
0d863a62-0dd8-401c-b699-2b642d9fd2b6.
sources=0 [2] sinks=1
[2018-07-20 07:38:41.726943] I [MSGID: 108026]
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
0-shared-replicate-3: performing metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
[2018-07-20 07:38:41.855737] I [MSGID: 108026]
[afr-self-heal-common.c:1724:afr_log_selfheal]
Completed metadata selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba.
sources=[0] 2 sinks=1
[2018-07-20 07:38:44.755800] I [MSGID: 108026]
[afr-self-heal-entry.c:887:afr_selfheal_entry_do]
0-shared-replicate-3: performing entry selfheal on
9276097a-cdac-4d12-9dc6-04b1ea4458ba
is this behaviour normal? I'd expect these messages on the
server
with
the failed brick, not on the other ones.
Post by Hu Bert
Hi there,
sent this mail yesterday, but somehow it didn't work? Wasn't
archived,
so please be indulgent it you receive this mail again :-)
We are currently running a replicate setup and are
experiencing a
quite poor performance. It got even worse when within a
couple
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
of
weeks 2 bricks (disks) crashed. Maybe some general
information
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
of
our
3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB
DDR4,
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
OS
on
separate disks); each server has 4 10TB disks -> each is a
brick;
replica 3 setup (see gluster volume status below). Debian
stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients
are
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
connected via 10 GBit ethernet.
About a month ago and 2 days ago a disk died (on different
servers);
disk were replaced, were brought back into the volume and
full
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
self
heal started. But the speed for this is quite...
disappointing.
Each
brick has ~1.6TB of data on it (mostly the infamous small
files).
The
full heal i started yesterday copied only ~50GB within 24
hours
(48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.
After the first heal (started on gluster13 about a month
ago,
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
took
about 3 weeks) finished we had a terrible performance; CPU
on
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
one
or
two of the nodes (gluster11, gluster12) was up to 1200%,
consumed
by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on
the
other
2 ones...
Well... am i doing something wrong? Some options wrongly
configured?
Terrible setup? Anyone got an idea? Any additional
information
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
needed?
Thx in advance :-)
gluster volume status
Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-07-27 06:23:25 UTC
Permalink
Do you already have all the 190000 directories already created? If not could you find out which of the paths need it and do a stat directly instead of find?
Quite probable not all of them have been created (but counting how
much would take very long...). Hm, maybe running stat in a double loop
(thx to our directory structure) would help. Something like this (may
be not 100% correct):

for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done

Should run stat on all directories. I think i'll give this a try.
Pranith Kumar Karampuri
2018-07-27 06:52:04 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Do you already have all the 190000 directories already created? If not
could you find out which of the paths need it and do a stat directly
instead of find?
Quite probable not all of them have been created (but counting how
much would take very long...). Hm, maybe running stat in a double loop
(thx to our directory structure) would help. Something like this (may
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
Should run stat on all directories. I think i'll give this a try.
Just to prevent these served from a cache, it is probably better to do this
from a fresh mount?
--
Pranith
Hu Bert
2018-07-27 07:06:16 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Do you already have all the 190000 directories already created? If not
could you find out which of the paths need it and do a stat directly instead
of find?
Quite probable not all of them have been created (but counting how
much would take very long...). Hm, maybe running stat in a double loop
(thx to our directory structure) would help. Something like this (may
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
Should run stat on all directories. I think i'll give this a try.
Just to prevent these served from a cache, it is probably better to do this
from a fresh mount?
--
Pranith
Good idea. I'll install glusterfs client on a little used machine, so
there should be no caching. Thx! Have a good weekend when the time
comes :-)
Pranith Kumar Karampuri
2018-07-27 07:22:15 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Do you already have all the 190000 directories already created? If not
could you find out which of the paths need it and do a stat directly
instead
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
of find?
Quite probable not all of them have been created (but counting how
much would take very long...). Hm, maybe running stat in a double loop
(thx to our directory structure) would help. Something like this (may
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
Should run stat on all directories. I think i'll give this a try.
Just to prevent these served from a cache, it is probably better to do
this
Post by Pranith Kumar Karampuri
from a fresh mount?
--
Pranith
Good idea. I'll install glusterfs client on a little used machine, so
there should be no caching. Thx! Have a good weekend when the time
comes :-)
If this proves effective, what you need to also do is unmount and mount
again, something like:

mount
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
unmount
--
Pranith
Hu Bert
2018-07-27 08:02:46 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Do you already have all the 190000 directories already created? If not
could you find out which of the paths need it and do a stat directly instead
of find?
Quite probable not all of them have been created (but counting how
much would take very long...). Hm, maybe running stat in a double loop
(thx to our directory structure) would help. Something like this (may
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
Should run stat on all directories. I think i'll give this a try.
Just to prevent these served from a cache, it is probably better to do this
from a fresh mount?
--
Pranith
Good idea. I'll install glusterfs client on a little used machine, so
there should be no caching. Thx! Have a good weekend when the time
comes :-)
If this proves effective, what you need to also do is unmount and mount
mount
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
unmount
I'll see what is possible over the weekend.

Btw.: i've seen in the munin stats that the disk utilization for
bricksdd1 on the healthy gluster servers is between 70% (night) and
almost 99% (daytime). So it looks like that the basic problem is the
disk which seems not to be able to work faster? If so (heal)
performance won't improve with this setup, i assume. Maybe switching
to RAID10 (conventional hard disks), SSDs or even add 3 additional
gluster servers (distributed replicated) could help?
Pranith Kumar Karampuri
2018-07-27 08:31:36 UTC
Permalink
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Do you already have all the 190000 directories already created? If not
could you find out which of the paths need it and do a stat
directly
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
instead
of find?
Quite probable not all of them have been created (but counting how
much would take very long...). Hm, maybe running stat in a double
loop
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
(thx to our directory structure) would help. Something like this (may
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
Should run stat on all directories. I think i'll give this a try.
Just to prevent these served from a cache, it is probably better to do this
from a fresh mount?
--
Pranith
Good idea. I'll install glusterfs client on a little used machine, so
there should be no caching. Thx! Have a good weekend when the time
comes :-)
If this proves effective, what you need to also do is unmount and mount
mount
for a in ${100..999}; do
for b in ${100..999}; do
stat /$a/$b/
done
done
unmount
I'll see what is possible over the weekend.
Btw.: i've seen in the munin stats that the disk utilization for
bricksdd1 on the healthy gluster servers is between 70% (night) and
almost 99% (daytime). So it looks like that the basic problem is the
disk which seems not to be able to work faster? If so (heal)
performance won't improve with this setup, i assume.
It could be saturating in the day. But if enough self-heals are going on,
even in the night
it should have been close to 100%.
Post by Hu Bert
Maybe switching
to RAID10 (conventional hard disks), SSDs or even add 3 additional
gluster servers (distributed replicated) could help?
It definitely will give better protection against hardware failure. Failure
domain will be lesser.
--
Pranith
Hu Bert
2018-07-27 08:47:44 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Hu Bert
Btw.: i've seen in the munin stats that the disk utilization for
bricksdd1 on the healthy gluster servers is between 70% (night) and
almost 99% (daytime). So it looks like that the basic problem is the
disk which seems not to be able to work faster? If so (heal)
performance won't improve with this setup, i assume.
It could be saturating in the day. But if enough self-heals are going on,
even in the night it should have been close to 100%.
Lowest utilization was 70% over night, but i'll check this
evening/weekend. Also that 'stat...' is running.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Maybe switching
to RAID10 (conventional hard disks), SSDs or even add 3 additional
gluster servers (distributed replicated) could help?
It definitely will give better protection against hardware failure. Failure
domain will be lesser.
What, in your opinion, would be better for permormance?

- Having 3 servers and RAID10 (with conventional disks)
- Having 3 additional servers with 4 hdds (JBOD) each (distribute
replicate, replica 3)
- SSDs? (would be quite expensive to reach the storage amount we have
at the moment)

Just curious. It seems we'll have to adjust our setup during winter anyway :-)

Thanx again :-)
Hu Bert
2018-08-01 07:31:54 UTC
Permalink
Hello :-) Just wanted to give a short report...
Post by Hu Bert
Post by Pranith Kumar Karampuri
It could be saturating in the day. But if enough self-heals are going on,
even in the night it should have been close to 100%.
Lowest utilization was 70% over night, but i'll check this
evening/weekend. Also that 'stat...' is running.
At the moment 1.1TB of 2.0TB got healed, disk utilization still
between 100% (day) and 70% (night). So this will take another 10-14
days.
Post by Hu Bert
What, in your opinion, would be better for permormance?
- Having 3 servers and RAID10 (with conventional disks)
- Having 3 additional servers with 4 hdds (JBOD) each (distribute
replicate, replica 3)
- SSDs? (would be quite expensive to reach the storage amount we have
at the moment)
Just curious. It seems we'll have to adjust our setup during winter anyway :-)
Well, we'll definitely rethink our setup this autumn :-)
Hu Bert
2018-08-14 07:37:54 UTC
Permalink
Hi there,

well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?

But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.

Is this behaviour normal? Is there some post-heal after a heal has finished?

thx in advance :-)
Hu Bert
2018-08-15 09:07:37 UTC
Permalink
Hello again :-)

The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.

But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
usage:

Loading Image...
Loading Image...
Loading Image...

This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?


Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has finished?
thx in advance :-)
Hu Bert
2018-08-16 09:57:28 UTC
Permalink
Hi,

well, as the situation doesn't get better, we're quite helpless and
mostly in the dark, so we're thinking about hiring some professional
support. Any hint? :-)
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has finished?
thx in advance :-)
Pranith Kumar Karampuri
2018-08-17 05:33:25 UTC
Permalink
Could you do the following on one of the nodes where you are observing high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when you see the ~100% CPU.

top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has
finished?
Post by Hu Bert
thx in advance :-)
--
Pranith
Pranith Kumar Karampuri
2018-08-17 06:26:17 UTC
Permalink
As per the output, all io-threads are using a lot of CPU. It is better to
check what the volume profile is to see what is leading to so much work for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "Running GlusterFS Volume Profile Command"and attach output
of "gluster
volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are observing
high
Post by Pranith Kumar Karampuri
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when
Post by Pranith Kumar Karampuri
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has finished?
thx in advance :-)
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-17 06:31:08 UTC
Permalink
Please do volume profile also for around 10 minutes when CPU% is high.

On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is better to
check what the volume profile is to see what is leading to so much work for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "Running GlusterFS Volume Profile Command"and attach output of "gluster
volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are observing
high
Post by Pranith Kumar Karampuri
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when
Post by Pranith Kumar Karampuri
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly
failed
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
bricksdd1, but by all 4 brick processes (and their threads). Load
goes
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-17 06:48:48 UTC
Permalink
i hope i did get it right.

gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop

If that's ok, i've attached the output of the info command.
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is better to
check what the volume profile is to see what is leading to so much work for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are observing high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-17 07:30:26 UTC
Permalink
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is better
to
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
check what the volume profile is to see what is leading to so much work
for
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are
observing
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10
minutes
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency
(average
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the
cpu
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange
behaviour?
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find
any
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
related log message; is there such a message in a specific log
file?
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the
server
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-17 08:19:03 UTC
Permalink
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to migrate
the latter ones to local storage asap.

Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.

gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU

Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is better to
check what the volume profile is to see what is leading to so much work for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are observing
high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1, but by all 4 brick processes (and their threads). Load goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-20 08:51:21 UTC
Permalink
There are a lot of Lookup operations in the system. But I am not able to
find why. Could you check the output of

# gluster volume heal <volname> info | grep -i number

it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is
better
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
to
check what the volume profile is to see what is leading to so much
work
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are observing
high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly
failed
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
disk has a load of 20-30.I've uploaded some munin graphics of the cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one
not
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log
file?
But i see the same behaviour when the last heal finished: all
CPU
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
cores are consumed by brick processes; not only by the formerly
failed
bricksdd1, but by all 4 brick processes (and their threads).
Load
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the
server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-20 09:10:48 UTC
Permalink
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0

Looks good to me.
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is better
to
check what the volume profile is to see what is leading to so much work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are observing
high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency
(average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly failed
disk has a load of 20-30.I've uploaded some munin graphics of the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one not
that much. Does anyone have an explanation of this strange behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't see/find
any
related log message; is there such a message in a specific log
file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly
failed
bricksdd1, but by all 4 brick processes (and their threads). Load
goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on the
server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-20 09:20:56 UTC
Permalink
Even the brick which doesn't have high CPU seems to have same number of
lookups, so that's not it.
Is there any difference at all between the machines which have high CPU vs
low CPU?
I think the only other thing I would do is to install perf tools and try to
figure out the call-graph which is leading to so much CPU

This affects performance of the brick I think, so you may have to do it
quickly and for less time.

perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is better
to
check what the volume profile is to see what is leading to so much work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are
observing
high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert <
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries
in
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
glustershd.log files anymore. According to munin disk latency
(average
io wait) has gone down to 100 ms, and disk utilization has
gone
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good
state)
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
fluctuates between 60 and 100; the server with the formerly
failed
disk has a load of 20-30.I've uploaded some munin graphics of the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and
one
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
not
that much. Does anyone have an explanation of this strange
behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't
see/find
any
related log message; is there such a message in a specific
log
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
file?
all
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
CPU
cores are consumed by brick processes; not only by the
formerly
failed
bricksdd1, but by all 4 brick processes (and their threads).
Load
goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on
the
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a
heal
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-20 09:50:45 UTC
Permalink
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.

Ok, i ran perf for a few seconds.

------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Warning:
Processed 83690 events and lost 96 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------

I copied a couple of lines:

+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k] __getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k] do_syscall_64

Do you need different or additional information?
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number of
lookups, so that's not it.
Is there any difference at all between the machines which have high CPU vs
low CPU?
I think the only other thing I would do is to install perf tools and try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is better
to
check what the volume profile is to see what is leading to so much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are
observing
high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency
(average
io wait) has gone down to 100 ms, and disk utilization has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good state)
fluctuates between 60 and 100; the server with the formerly
failed
disk has a load of 20-30.I've uploaded some munin graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and one
not
that much. Does anyone have an explanation of this strange
behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't
see/find
any
related log message; is there such a message in a specific
log
file?
all
CPU
cores are consumed by brick processes; not only by the
formerly
failed
bricksdd1, but by all 4 brick processes (and their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed brick, and
glustershd.log gets filled with a lot of entries. Load on
the
server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-21 04:43:43 UTC
Permalink
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k] __getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k] do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different from
what we observed earlier, how many seconds did you do perf record for? Will
it be possible for you to do this for some more time? may be a minute? Just
want to be sure that the data actually represents what we are observing.
Post by Hu Bert
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number of
lookups, so that's not it.
Is there any difference at all between the machines which have high CPU
vs
Post by Pranith Kumar Karampuri
low CPU?
I think the only other thing I would do is to install perf tools and try
to
Post by Pranith Kumar Karampuri
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not able
to
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume.
Besides
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to
migrate
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is
better
to
check what the volume profile is to see what is leading to so much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert <
Good morning,
i ran the command during 100% CPU usage and attached the file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you are
observing
high
CPU usage and attach that file to this thread? We can find
what
threads/processes are leading to high usage. Do this for say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log
entries
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
in
glustershd.log files anymore. According to munin disk
latency
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
(average
io wait) has gone down to 100 ms, and disk utilization has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good
state)
fluctuates between 60 and 100; the server with the formerly
failed
disk has a load of 20-30.I've uploaded some munin graphics
of
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load and
one
not
that much. Does anyone have an explanation of this strange
behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't
see/find
any
related log message; is there such a message in a
specific
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
log
file?
all
CPU
cores are consumed by brick processes; not only by the
formerly
failed
bricksdd1, but by all 4 brick processes (and their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed brick,
and
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
glustershd.log gets filled with a lot of entries. Load on
the
server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-21 05:13:05 UTC
Permalink
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k] __getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k] do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different from
what we observed earlier, how many seconds did you do perf record for? Will
it be possible for you to do this for some more time? may be a minute? Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number of
lookups, so that's not it.
Is there any difference at all between the machines which have high CPU
vs
Post by Pranith Kumar Karampuri
low CPU?
I think the only other thing I would do is to install perf tools and
try to
Post by Pranith Kumar Karampuri
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not
able to
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is
done
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
by apache tomcat servers writing to / reading from the volume.
Besides
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to
migrate
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing
cpu
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU%
is
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It
is
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
better
to
check what the volume profile is to see what is leading to so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert <
Good morning,
i ran the command during 100% CPU usage and attached the
file.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you
are
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
observing
high
CPU usage and attach that file to this thread? We can find
what
threads/processes are leading to high usage. Do this for
say
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log
entries
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
in
glustershd.log files anymore. According to munin disk
latency
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
(average
io wait) has gone down to 100 ms, and disk utilization has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good
state)
fluctuates between 60 and 100; the server with the
formerly
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load
and
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
one
not
that much. Does anyone have an explanation of this strange
behaviour?
Thx :-)
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't
see/find
any
related log message; is there such a message in a
specific
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only by the
formerly
failed
bricksdd1, but by all 4 brick processes (and their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed brick,
and
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
glustershd.log gets filled with a lot of entries. Load
on
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
the
server
with the then failed brick not that high, but still ~60.
Is this behaviour normal? Is there some post-heal after
a
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-21 06:10:51 UTC
Permalink
Good morning :-)

gluster11:
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd

gluster12:
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82

gluster13:
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e


And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)

fyi - load at the moment:
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50

perf record --call-graph=dwarf -p 7897 -o /tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Warning:
Processed 2137218 events and lost 33446 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]

Here's an excerpt.

+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir

Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k] __getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k] do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different from
what we observed earlier, how many seconds did you do perf record for? Will
it be possible for you to do this for some more time? may be a minute? Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number of
lookups, so that's not it.
Is there any difference at all between the machines which have high CPU vs
low CPU?
I think the only other thing I would do is to install perf tools and try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when CPU% is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU. It is
better
to
check what the volume profile is to see what is leading to so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you
are
observing
high
CPU usage and attach that file to this thread? We can find
what
threads/processes are leading to high usage. Do this for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log
entries
in
glustershd.log files anymore. According to munin disk
latency
(average
io wait) has gone down to 100 ms, and disk utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished. Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only by the
formerly
failed
bricksdd1, but by all 4 brick processes (and their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed brick,
and
glustershd.log gets filled with a lot of entries. Load
on
the
server
with the then failed brick not that high, but still
~60.
Is this behaviour normal? Is there some post-heal after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-21 06:17:15 UTC
Permalink
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o /tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
filldir
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k] __getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k] do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different
from
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
what we observed earlier, how many seconds did you do perf record for?
Will
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
it be possible for you to do this for some more time? may be a minute?
Just
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me
the
Post by Pranith Kumar Karampuri
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the
same
Post by Pranith Kumar Karampuri
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number
of
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
lookups, so that's not it.
Is there any difference at all between the machines which have high
CPU
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
vs
low CPU?
I think the only other thing I would do is to install perf tools and try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do
it
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert <
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU.
It
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
is
better
to
check what the volume profile is to see what is leading to
so
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where you
are
observing
high
CPU usage and attach that file to this thread? We can
find
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
what
threads/processes are leading to high usage. Do this for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log
entries
in
glustershd.log files anymore. According to munin disk
latency
(average
io wait) has gone down to 100 ms, and disk utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only by
the
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
formerly
failed
bricksdd1, but by all 4 brick processes (and their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
and
glustershd.log gets filled with a lot of entries.
Load
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
on
the
server
with the then failed brick not that high, but still
~60.
Is this behaviour normal? Is there some post-heal
after
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-22 06:31:01 UTC
Permalink
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
the node with the lowest load i see in cli.log.1:

[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now

every 3 seconds. Looks like this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.

In cli.log there are only these entries:

[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0

Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k] __getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k] do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different from
what we observed earlier, how many seconds did you do perf record for? Will
it be possible for you to do this for some more time? may be a minute? Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number of
lookups, so that's not it.
Is there any difference at all between the machines which have high CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools and try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o </path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume.
Besides
images there are some text files and binaries that are stored on the
volume and get updated regularly (every x hours); we'll try to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1) was way
better than it is now. I've attached 2 pngs showing the differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of CPU.
It
is
better
to
check what the volume profile is to see what is leading to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where
you
are
observing
high
CPU usage and attach that file to this thread? We can
find
what
threads/processes are leading to high usage. Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log
entries
in
glustershd.log files anymore. According to munin disk
latency
(average
io wait) has gone down to 100 ms, and disk utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of entries.
Load
on
the
server
with the then failed brick not that high, but still
~60.
Is this behaviour normal? Is there some post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-08-23 11:58:12 UTC
Permalink
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
filldir
Post by Pranith Kumar Karampuri
Post by Hu Bert
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side
of
Post by Pranith Kumar Karampuri
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k] ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k] ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different from
what we observed earlier, how many seconds did you do perf record
for?
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Will
it be possible for you to do this for some more time? may be a
minute?
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the
three
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same
number
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
of
lookups, so that's not it.
Is there any difference at all between the machines which have
high
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools
and
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to
do
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am
not
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert <
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work
is
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
done
by apache tomcat servers writing to / reading from the volume.
Besides
images there are some text files and binaries that are stored
on
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
the
volume and get updated regularly (every x hours); we'll try to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1)
was
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
way
better than it is now. I've attached 2 pngs showing the differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of
CPU.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
It
is
better
to
check what the volume profile is to see what is leading
to
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached
the
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where
you
are
observing
high
CPU usage and attach that file to this thread? We can
find
what
threads/processes are leading to high usage. Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no log
entries
in
glustershd.log files anymore. According to munin
disk
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of entries.
Load
on
the
server
with the then failed brick not that high, but
still
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
~60.
Is this behaviour normal? Is there some post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Milind Changire
2018-08-27 13:41:34 UTC
Permalink
On Thu, Aug 23, 2018 at 5:28 PM, Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
That's odd. Presuming cli.log.1 is the logrotated file, it should be
showing older log entries than cli.log. But its not the case here.
Or maybe, there's something running on the command-line on the node with
the lowest load.
Post by Pranith Kumar Karampuri
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
filldir
Post by Pranith Kumar Karampuri
Post by Hu Bert
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side
of
Post by Pranith Kumar Karampuri
things. So I will have to show this information to someone else who
knows
Post by Pranith Kumar Karampuri
these things so expect delay in response.
Post by Hu Bert
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650
v3
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is
different
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
from
what we observed earlier, how many seconds did you do perf record
for?
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Will
it be possible for you to do this for some more time? may be a
minute?
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give
me
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the
three
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same
number
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
of
lookups, so that's not it.
Is there any difference at all between the machines which have
high
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools
and
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have
to do
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am
not
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert <
Post by Hu Bert
I don't know what you exactly mean with workload, but the
main
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
function of the volume is storing (incl. writing, reading)
images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The
work is
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
done
by apache tomcat servers writing to / reading from the
volume.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Besides
images there are some text files and binaries that are
stored on
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
the
volume and get updated regularly (every x hours); we'll try
to
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the
same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1)
was
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to
any
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of
CPU.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
It
is
better
to
check what the volume profile is to see what is
leading to
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.
io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached
the
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where
you
are
observing
high
CPU usage and attach that file to this thread? We
can
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
find
what
threads/processes are leading to high usage. Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no
log
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
entries
in
glustershd.log files anymore. According to munin
disk
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only
by
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
the
formerly
failed
bricksdd1, but by all 4 brick processes (and
their
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of entries.
Load
on
the
server
with the then failed brick not that high, but
still
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
~60.
Is this behaviour normal? Is there some post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Milind
Hu Bert
2018-08-27 14:14:58 UTC
Permalink
yeah, on debian xyz.log.1 is always the former logfile which has been
rotated by logrotate. Just checked the 3 servers: now it looks good, i
will check it again tomorrow. very strange, maybe logrotate hasn't
worked properly.

the performance problems remain :-)
Post by Milind Changire
On Thu, Aug 23, 2018 at 5:28 PM, Pranith Kumar Karampuri
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
+Milind Changire
That's odd. Presuming cli.log.1 is the logrotated file, it should be showing
older log entries than cli.log. But its not the case here.
Or maybe, there's something running on the command-line on the node with the
lowest load.
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
2018-08-21 7:13 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different
from
what we observed earlier, how many seconds did you do perf record for?
Will
it be possible for you to do this for some more time? may be a minute?
Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the
same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number
of
lookups, so that's not it.
Is there any difference at all between the machines which have high
CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools and
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am
not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert
Post by Hu Bert
I don't know what you exactly mean with workload, but the
main
function of the volume is storing (incl. writing, reading)
images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The
work is
done
by apache tomcat servers writing to / reading from the
volume.
Besides
images there are some text files and binaries that are
stored on
the
volume and get updated regularly (every x hours); we'll try
to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the
same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1)
was
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to
any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes
when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar
Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of
CPU.
It
is
better
to
check what the volume profile is to see what is
leading to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached
the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes
where
you
are
observing
high
CPU usage and attach that file to this thread? We
can
find
what
threads/processes are leading to high usage. Do
this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no
log
entries
in
glustershd.log files anymore. According to munin
disk
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in
the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in
a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only
by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and
their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of
entries.
Load
on
the
server
with the then failed brick not that high, but
still
~60.
Is this behaviour normal? Is there some
post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Milind
Hu Bert
2018-08-28 05:04:24 UTC
Permalink
Good Morning,

today i update + rebooted all gluster servers, kernel update to
4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
gluster servers (gluster13) one of the bricks did come up at the
beginning but then lost connection.

OK:

Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
[...]
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49155 0
Y 2136

Lost connection:

Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared N/A N/A
N N/A

gluster volume heal shared info:
Brick gluster13:/gluster/bricksdd1_new/shared
Status: Transport endpoint is not connected
Number of entries: -

reboot was at 06:15:39; brick then worked for a short period, but then
somehow disconnected.

from gluster13:/var/log/glusterfs/glusterd.log:

[2018-08-28 04:27:36.944608] I [MSGID: 106005]
[glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management:
Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from
glusterd.
[2018-08-28 04:28:57.869666] I
[glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a
fresh brick process for brick /gluster/bricksdd1_new/shared
[2018-08-28 04:35:20.732666] I [MSGID: 106143]
[glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
/gluster/bricksdd1_new/shared on port 49157

After 'gluster volume start shared force' (then with new port 49157):

Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994

from /var/log/syslog:

Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: pending frames:
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: frame :
type(0) op(0)
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: frame :
type(0) op(0)
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]:
patchset: git://git.gluster.org/glusterfs.git
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: signal
received: 11
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: time of crash:
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]:
2018-08-28 04:27:36
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]:
configuration details:
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: argp 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: backtrace 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: dlfcn 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: libpthread 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: llistxattr 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: setfsid 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: spinlock 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: epoll.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: xattr.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: st_atim.tv_nsec 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]:
package-string: glusterfs 3.12.13
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: ---------

There are some errors+warnings in the shared.log (volume logfile), but
no error message telling me why
gluster13:/gluster/bricksdd1_new/shared has disconnected.

Well... at the moment load is ok, all 3 servers at about 15 (but i
think it will go up when more users will cause more traffic -> more
work on servers), 'gluster volume heal shared info' shows no entries,
status:

Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster11:/gluster/bricksda1/shared 49152 0 Y 2482
Brick gluster12:/gluster/bricksda1/shared 49152 0 Y 2088
Brick gluster13:/gluster/bricksda1/shared 49152 0 Y 2115
Brick gluster11:/gluster/bricksdb1/shared 49153 0 Y 2489
Brick gluster12:/gluster/bricksdb1/shared 49153 0 Y 2094
Brick gluster13:/gluster/bricksdb1/shared 49153 0 Y 2116
Brick gluster11:/gluster/bricksdc1/shared 49154 0 Y 2497
Brick gluster12:/gluster/bricksdc1/shared 49154 0 Y 2095
Brick gluster13:/gluster/bricksdc1/shared 49154 0 Y 2127
Brick gluster11:/gluster/bricksdd1/shared 49155 0 Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
Self-heal Daemon on localhost N/A N/A Y 4868
Self-heal Daemon on gluster12 N/A N/A Y 3813
Self-heal Daemon on gluster11 N/A N/A Y 5762

Task Status of Volume shared
------------------------------------------------------------------------------
There are no active volume tasks

Very strange. Thanks for reading if you've reached this line :-)
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
+Milind Changire
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
2018-08-21 7:13 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.] readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different
from
what we observed earlier, how many seconds did you do perf record for?
Will
it be possible for you to do this for some more time? may be a minute?
Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number
of
lookups, so that's not it.
Is there any difference at all between the machines which have high
CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools and
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert
Post by Hu Bert
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading)
images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work
is
done
by apache tomcat servers writing to / reading from the volume.
Besides
images there are some text files and binaries that are stored
on
the
volume and get updated regularly (every x hours); we'll try to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the
same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1)
was
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to
any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of
CPU.
It
is
better
to
check what the volume profile is to see what is leading
to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached
the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where
you
are
observing
high
CPU usage and attach that file to this thread? We
can
find
what
threads/processes are leading to high usage. Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no
log
entries
in
glustershd.log files anymore. According to munin
disk
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only
by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and
their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of entries.
Load
on
the
server
with the then failed brick not that high, but
still
~60.
Is this behaviour normal? Is there some post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-28 06:54:11 UTC
Permalink
a little update after about 2 hours of uptime: still/again high cpu
usage by one brick processes. server load >30.

gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far
gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
gluster13: high cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd

The process for brick bricksdd1 consumes almost all 12 cores.
Interestingly there are more threads for the bricksdd1 process than
for the other bricks. Counted with "ps huH p <PID_OF_U_PROCESS> | wc
-l"

gluster11:
bricksda1 59 threads, bricksdb1 65 threads, bricksdc1 68 threads,
bricksdd1 85 threads
gluster12:
bricksda1 65 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 58 threads
gluster13:
bricksda1 61 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 82 threads

Don't know if that could be relevant.
Post by Hu Bert
Good Morning,
today i update + rebooted all gluster servers, kernel update to
4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
gluster servers (gluster13) one of the bricks did come up at the
beginning but then lost connection.
Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
[...]
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49155 0
Y 2136
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared N/A N/A
N N/A
Brick gluster13:/gluster/bricksdd1_new/shared
Status: Transport endpoint is not connected
Number of entries: -
reboot was at 06:15:39; brick then worked for a short period, but then
somehow disconnected.
[2018-08-28 04:27:36.944608] I [MSGID: 106005]
Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from
glusterd.
[2018-08-28 04:28:57.869666] I
[glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a
fresh brick process for brick /gluster/bricksdd1_new/shared
[2018-08-28 04:35:20.732666] I [MSGID: 106143]
[glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
/gluster/bricksdd1_new/shared on port 49157
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
type(0) op(0)
type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: signal
received: 11
2018-08-28 04:27:36
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: argp 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: backtrace 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: dlfcn 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: libpthread 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: llistxattr 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: setfsid 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: spinlock 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: epoll.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: xattr.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: st_atim.tv_nsec 1
package-string: glusterfs 3.12.13
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: ---------
There are some errors+warnings in the shared.log (volume logfile), but
no error message telling me why
gluster13:/gluster/bricksdd1_new/shared has disconnected.
Well... at the moment load is ok, all 3 servers at about 15 (but i
think it will go up when more users will cause more traffic -> more
work on servers), 'gluster volume heal shared info' shows no entries,
Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster11:/gluster/bricksda1/shared 49152 0 Y 2482
Brick gluster12:/gluster/bricksda1/shared 49152 0 Y 2088
Brick gluster13:/gluster/bricksda1/shared 49152 0 Y 2115
Brick gluster11:/gluster/bricksdb1/shared 49153 0 Y 2489
Brick gluster12:/gluster/bricksdb1/shared 49153 0 Y 2094
Brick gluster13:/gluster/bricksdb1/shared 49153 0 Y 2116
Brick gluster11:/gluster/bricksdc1/shared 49154 0 Y 2497
Brick gluster12:/gluster/bricksdc1/shared 49154 0 Y 2095
Brick gluster13:/gluster/bricksdc1/shared 49154 0 Y 2127
Brick gluster11:/gluster/bricksdd1/shared 49155 0 Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
Self-heal Daemon on localhost N/A N/A Y 4868
Self-heal Daemon on gluster12 N/A N/A Y 3813
Self-heal Daemon on gluster11 N/A N/A Y 5762
Task Status of Volume shared
------------------------------------------------------------------------------
There are no active volume tasks
Very strange. Thanks for reading if you've reached this line :-)
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
+Milind Changire
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
2018-08-21 7:13 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k] filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different
from
what we observed earlier, how many seconds did you do perf record for?
Will
it be possible for you to do this for some more time? may be a minute?
Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same number
of
lookups, so that's not it.
Is there any difference at all between the machines which have high
CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools and
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to do
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am
not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert
Post by Hu Bert
I don't know what you exactly mean with workload, but the
main
function of the volume is storing (incl. writing, reading)
images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work
is
done
by apache tomcat servers writing to / reading from the
volume.
Besides
images there are some text files and binaries that are stored
on
the
volume and get updated regularly (every x hours); we'll try
to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the
same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1)
was
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to
any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of
CPU.
It
is
better
to
check what the volume profile is to see what is leading
to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached
the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where
you
are
observing
high
CPU usage and attach that file to this thread? We
can
find
what
threads/processes are leading to high usage. Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no
log
entries
in
glustershd.log files anymore. According to munin
disk
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only
by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and
their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of entries.
Load
on
the
server
with the then failed brick not that high, but
still
~60.
Is this behaviour normal? Is there some post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-28 07:24:37 UTC
Permalink
Hm, i noticed that in the shared.log (volume log file) on gluster11
and gluster12 (but not on gluster13) i now see these warnings:

[2018-08-28 07:18:57.224367] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3054593291
[2018-08-28 07:19:17.733625] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2595205890
[2018-08-28 07:19:27.950355] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3105728076
[2018-08-28 07:19:42.519010] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3740415196
[2018-08-28 07:19:48.194774] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2922795043
[2018-08-28 07:19:52.506135] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2841655539
[2018-08-28 07:19:55.466352] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3049465001

Don't know if that could be related.
Post by Hu Bert
a little update after about 2 hours of uptime: still/again high cpu
usage by one brick processes. server load >30.
gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far
gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
gluster13: high cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
The process for brick bricksdd1 consumes almost all 12 cores.
Interestingly there are more threads for the bricksdd1 process than
for the other bricks. Counted with "ps huH p <PID_OF_U_PROCESS> | wc
-l"
bricksda1 59 threads, bricksdb1 65 threads, bricksdc1 68 threads,
bricksdd1 85 threads
bricksda1 65 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 58 threads
bricksda1 61 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 82 threads
Don't know if that could be relevant.
Post by Hu Bert
Good Morning,
today i update + rebooted all gluster servers, kernel update to
4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
gluster servers (gluster13) one of the bricks did come up at the
beginning but then lost connection.
Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
[...]
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49155 0
Y 2136
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared N/A N/A
N N/A
Brick gluster13:/gluster/bricksdd1_new/shared
Status: Transport endpoint is not connected
Number of entries: -
reboot was at 06:15:39; brick then worked for a short period, but then
somehow disconnected.
[2018-08-28 04:27:36.944608] I [MSGID: 106005]
Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from
glusterd.
[2018-08-28 04:28:57.869666] I
[glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a
fresh brick process for brick /gluster/bricksdd1_new/shared
[2018-08-28 04:35:20.732666] I [MSGID: 106143]
[glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
/gluster/bricksdd1_new/shared on port 49157
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
type(0) op(0)
type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: signal
received: 11
2018-08-28 04:27:36
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: argp 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: backtrace 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: dlfcn 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: libpthread 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: llistxattr 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: setfsid 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: spinlock 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: epoll.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: xattr.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: st_atim.tv_nsec 1
package-string: glusterfs 3.12.13
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: ---------
There are some errors+warnings in the shared.log (volume logfile), but
no error message telling me why
gluster13:/gluster/bricksdd1_new/shared has disconnected.
Well... at the moment load is ok, all 3 servers at about 15 (but i
think it will go up when more users will cause more traffic -> more
work on servers), 'gluster volume heal shared info' shows no entries,
Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster11:/gluster/bricksda1/shared 49152 0 Y 2482
Brick gluster12:/gluster/bricksda1/shared 49152 0 Y 2088
Brick gluster13:/gluster/bricksda1/shared 49152 0 Y 2115
Brick gluster11:/gluster/bricksdb1/shared 49153 0 Y 2489
Brick gluster12:/gluster/bricksdb1/shared 49153 0 Y 2094
Brick gluster13:/gluster/bricksdb1/shared 49153 0 Y 2116
Brick gluster11:/gluster/bricksdc1/shared 49154 0 Y 2497
Brick gluster12:/gluster/bricksdc1/shared 49154 0 Y 2095
Brick gluster13:/gluster/bricksdc1/shared 49154 0 Y 2127
Brick gluster11:/gluster/bricksdd1/shared 49155 0 Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
Self-heal Daemon on localhost N/A N/A Y 4868
Self-heal Daemon on gluster12 N/A N/A Y 3813
Self-heal Daemon on gluster11 N/A N/A Y 5762
Task Status of Volume shared
------------------------------------------------------------------------------
There are no active volume tasks
Very strange. Thanks for reading if you've reached this line :-)
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
+Milind Changire
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k] do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k] do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k] do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k] do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k] do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
2018-08-21 7:13 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different
from
what we observed earlier, how many seconds did you do perf record for?
Will
it be possible for you to do this for some more time? may be a minute?
Just
want to be sure that the data actually represents what we are observing.
I found one code path which on lookup does readdirs. Could you give me
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same
number
of
lookups, so that's not it.
Is there any difference at all between the machines which have high
CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools
and
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to
do
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am
not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert
Post by Hu Bert
I don't know what you exactly mean with workload, but the
main
function of the volume is storing (incl. writing, reading)
images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work
is
done
by apache tomcat servers writing to / reading from the
volume.
Besides
images there are some text files and binaries that are stored
on
the
volume and get updated regularly (every x hours); we'll try
to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the
same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1)
was
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to
any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of
CPU.
It
is
better
to
check what the volume profile is to see what is leading
to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached
the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where
you
are
observing
high
CPU usage and attach that file to this thread? We
can
find
what
threads/processes are leading to high usage. Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no
log
entries
in
glustershd.log files anymore. According to munin
disk
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only
by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and
their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of entries.
Load
on
the
server
with the then failed brick not that high, but
still
~60.
Is this behaviour normal? Is there some post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-08-31 07:48:31 UTC
Permalink
Hi Pranith,

i just wanted to ask if you were able to get any feedback from your
colleagues :-)

btw.: we migrated some stuff (static resources, small files) to a nfs
server that we actually wanted to replace by glusterfs. Load and cpu
usage has gone down a bit, but still is asymmetric on the 3 gluster
servers.
Post by Hu Bert
Hm, i noticed that in the shared.log (volume log file) on gluster11
[2018-08-28 07:18:57.224367] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3054593291
[2018-08-28 07:19:17.733625] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2595205890
[2018-08-28 07:19:27.950355] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3105728076
[2018-08-28 07:19:42.519010] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3740415196
[2018-08-28 07:19:48.194774] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2922795043
[2018-08-28 07:19:52.506135] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2841655539
[2018-08-28 07:19:55.466352] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3049465001
Don't know if that could be related.
Post by Hu Bert
a little update after about 2 hours of uptime: still/again high cpu
usage by one brick processes. server load >30.
gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far
gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
gluster13: high cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
The process for brick bricksdd1 consumes almost all 12 cores.
Interestingly there are more threads for the bricksdd1 process than
for the other bricks. Counted with "ps huH p <PID_OF_U_PROCESS> | wc
-l"
bricksda1 59 threads, bricksdb1 65 threads, bricksdc1 68 threads,
bricksdd1 85 threads
bricksda1 65 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 58 threads
bricksda1 61 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 82 threads
Don't know if that could be relevant.
Post by Hu Bert
Good Morning,
today i update + rebooted all gluster servers, kernel update to
4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
gluster servers (gluster13) one of the bricks did come up at the
beginning but then lost connection.
Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
[...]
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49155 0
Y 2136
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared N/A N/A
N N/A
Brick gluster13:/gluster/bricksdd1_new/shared
Status: Transport endpoint is not connected
Number of entries: -
reboot was at 06:15:39; brick then worked for a short period, but then
somehow disconnected.
[2018-08-28 04:27:36.944608] I [MSGID: 106005]
Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from
glusterd.
[2018-08-28 04:28:57.869666] I
[glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a
fresh brick process for brick /gluster/bricksdd1_new/shared
[2018-08-28 04:35:20.732666] I [MSGID: 106143]
[glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
/gluster/bricksdd1_new/shared on port 49157
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
type(0) op(0)
type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: signal
received: 11
2018-08-28 04:27:36
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: argp 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: backtrace 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: dlfcn 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: libpthread 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: llistxattr 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: setfsid 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: spinlock 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: epoll.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: xattr.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: st_atim.tv_nsec 1
package-string: glusterfs 3.12.13
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: ---------
There are some errors+warnings in the shared.log (volume logfile), but
no error message telling me why
gluster13:/gluster/bricksdd1_new/shared has disconnected.
Well... at the moment load is ok, all 3 servers at about 15 (but i
think it will go up when more users will cause more traffic -> more
work on servers), 'gluster volume heal shared info' shows no entries,
Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster11:/gluster/bricksda1/shared 49152 0 Y 2482
Brick gluster12:/gluster/bricksda1/shared 49152 0 Y 2088
Brick gluster13:/gluster/bricksda1/shared 49152 0 Y 2115
Brick gluster11:/gluster/bricksdb1/shared 49153 0 Y 2489
Brick gluster12:/gluster/bricksdb1/shared 49153 0 Y 2094
Brick gluster13:/gluster/bricksdb1/shared 49153 0 Y 2116
Brick gluster11:/gluster/bricksdc1/shared 49154 0 Y 2497
Brick gluster12:/gluster/bricksdc1/shared 49154 0 Y 2095
Brick gluster13:/gluster/bricksdc1/shared 49154 0 Y 2127
Brick gluster11:/gluster/bricksdd1/shared 49155 0 Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
Self-heal Daemon on localhost N/A N/A Y 4868
Self-heal Daemon on gluster12 N/A N/A Y 3813
Self-heal Daemon on gluster11 N/A N/A Y 5762
Task Status of Volume shared
------------------------------------------------------------------------------
There are no active volume tasks
Very strange. Thanks for reading if you've reached this line :-)
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
+Milind Changire
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k]
do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k]
do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k]
do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k]
do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k]
do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.] readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_readdir
Or do you want to download the file /tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
2018-08-21 7:13 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown] [k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so [.]
readdir64
+ 8.10% 0.00% glusteriotwr22 index.so [.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms] [k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms] [k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms] [k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms] [k]
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms] [k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown] [k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms] [k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is different
from
what we observed earlier, how many seconds did you do perf record for?
Will
it be possible for you to do this for some more time? may be a minute?
Just
want to be sure that the data actually represents what we are
observing.
I found one code path which on lookup does readdirs. Could you give me
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the three
bricks? It can probably give a correlation to see if it is indeed the
same
issue or not.
Post by Pranith Kumar Karampuri
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have same
number
of
lookups, so that's not it.
Is there any difference at all between the machines which have
high
CPU
vs
low CPU?
I think the only other thing I would do is to install perf tools
and
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may have to
do
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
then
perf report -i </path/to/output/given/in/the/previous/command>
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But I am
not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert
Post by Hu Bert
I don't know what you exactly mean with workload, but the
main
function of the volume is storing (incl. writing, reading)
images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work
is
done
by apache tomcat servers writing to / reading from the
volume.
Besides
images there are some text files and binaries that are stored
on
the
volume and get updated regularly (every x hours); we'll try
to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads) of the
same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12, bricksdd1)
was
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared to
any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10 minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot of
CPU.
It
is
better
to
check what the volume profile is to see what is leading
to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and attached
the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes where
you
are
observing
high
CPU usage and attach that file to this thread? We
can
find
what
threads/processes are leading to high usage. Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are no
log
entries
in
glustershd.log files anymore. According to munin
disk
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were in the
good
state)
fluctuates between 60 and 100; the server with the
formerly
failed
disk has a load of 20-30.I've uploaded some munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under heavy
load
and
one
not
that much. Does anyone have an explanation of this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally finished.
Couldn't
see/find
any
related log message; is there such a message in a
specific
log
file?
But i see the same behaviour when the last heal
all
CPU
cores are consumed by brick processes; not only
by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and
their
threads).
Load
goes
up to > 100 on the 2 servers with the not-failed
brick,
and
glustershd.log gets filled with a lot of entries.
Load
on
the
server
with the then failed brick not that high, but
still
~60.
Is this behaviour normal? Is there some post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Pranith Kumar Karampuri
2018-09-03 05:55:07 UTC
Permalink
Post by Hu Bert
Hi Pranith,
i just wanted to ask if you were able to get any feedback from your
colleagues :-)
Sorry, I didn't get a chance to. I am working on a customer issue which is
taking away cycles from any other work. Let me get back to you once I get
time this week.
Post by Hu Bert
btw.: we migrated some stuff (static resources, small files) to a nfs
server that we actually wanted to replace by glusterfs. Load and cpu
usage has gone down a bit, but still is asymmetric on the 3 gluster
servers.
Post by Hu Bert
Hm, i noticed that in the shared.log (volume log file) on gluster11
[2018-08-28 07:18:57.224367] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3054593291
[2018-08-28 07:19:17.733625] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2595205890
[2018-08-28 07:19:27.950355] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3105728076
[2018-08-28 07:19:42.519010] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3740415196
[2018-08-28 07:19:48.194774] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2922795043
[2018-08-28 07:19:52.506135] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2841655539
[2018-08-28 07:19:55.466352] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3049465001
Don't know if that could be related.
Post by Hu Bert
a little update after about 2 hours of uptime: still/again high cpu
usage by one brick processes. server load >30.
gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far
gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change
/dev/sdd
Post by Hu Bert
Post by Hu Bert
gluster13: high cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
The process for brick bricksdd1 consumes almost all 12 cores.
Interestingly there are more threads for the bricksdd1 process than
for the other bricks. Counted with "ps huH p <PID_OF_U_PROCESS> | wc
-l"
bricksda1 59 threads, bricksdb1 65 threads, bricksdc1 68 threads,
bricksdd1 85 threads
bricksda1 65 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 58 threads
bricksda1 61 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 82 threads
Don't know if that could be relevant.
Post by Hu Bert
Good Morning,
today i update + rebooted all gluster servers, kernel update to
4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
gluster servers (gluster13) one of the bricks did come up at the
beginning but then lost connection.
Status of volume: shared
Gluster process TCP Port RDMA Port
Online Pid
------------------------------------------------------------------------------
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
[...]
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49155 0
Y 2136
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared N/A N/A
N N/A
Brick gluster13:/gluster/bricksdd1_new/shared
Status: Transport endpoint is not connected
Number of entries: -
reboot was at 06:15:39; brick then worked for a short period, but then
somehow disconnected.
[2018-08-28 04:27:36.944608] I [MSGID: 106005]
Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from
glusterd.
[2018-08-28 04:28:57.869666] I
[glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a
fresh brick process for brick /gluster/bricksdd1_new/shared
[2018-08-28 04:35:20.732666] I [MSGID: 106143]
[glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
/gluster/bricksdd1_new/shared on port 49157
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: pending
type(0) op(0)
type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: signal
received: 11
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: time of
2018-08-28 04:27:36
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: argp 1
backtrace 1
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: dlfcn 1
libpthread 1
llistxattr 1
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: setfsid 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: spinlock
1
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: epoll.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: xattr.h 1
st_atim.tv_nsec 1
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
package-string: glusterfs 3.12.13
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: ---------
There are some errors+warnings in the shared.log (volume logfile), but
no error message telling me why
gluster13:/gluster/bricksdd1_new/shared has disconnected.
Well... at the moment load is ok, all 3 servers at about 15 (but i
think it will go up when more users will cause more traffic -> more
work on servers), 'gluster volume heal shared info' shows no entries,
Status of volume: shared
Gluster process TCP Port RDMA Port
Online Pid
------------------------------------------------------------------------------
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster11:/gluster/bricksda1/shared 49152 0 Y
2482
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster12:/gluster/bricksda1/shared 49152 0 Y
2088
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster13:/gluster/bricksda1/shared 49152 0 Y
2115
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster11:/gluster/bricksdb1/shared 49153 0 Y
2489
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster12:/gluster/bricksdb1/shared 49153 0 Y
2094
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster13:/gluster/bricksdb1/shared 49153 0 Y
2116
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster11:/gluster/bricksdc1/shared 49154 0 Y
2497
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster12:/gluster/bricksdc1/shared 49154 0 Y
2095
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster13:/gluster/bricksdc1/shared 49154 0 Y
2127
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster11:/gluster/bricksdd1/shared 49155 0 Y
2506
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
Self-heal Daemon on localhost N/A N/A Y
4868
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Self-heal Daemon on gluster12 N/A N/A Y
3813
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Self-heal Daemon on gluster11 N/A N/A Y
5762
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Task Status of Volume shared
------------------------------------------------------------------------------
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
There are no active volume tasks
Very strange. Thanks for reading if you've reached this line :-)
2018-08-23 13:58 GMT+02:00 Pranith Kumar Karampuri <
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
+Milind Changire
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started
running
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting
with: 0
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Just wondered if this could related anyhow.
2018-08-21 8:17 GMT+02:00 Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute -
file
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
grew pretty fast to a size of 17 GB and system load went up
heavily.
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k]
do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k]
do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k]
do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k]
do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k]
do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k] filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_readdir
Or do you want to download the file
/tmp/perf.gluster11.bricksdd1.out
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel
side
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
of
things. So I will have to show this information to someone else who knows
these things so expect delay in response.
Post by Hu Bert
2018-08-21 7:13 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
On Mon, Aug 20, 2018 at 3:20 PM Hu Bert <
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon
E5-1650
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12
GBit/s
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
RAID Controller; operating system running on a raid1, then 4
disks
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so
[.]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so
[.]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
readdir64
+ 8.10% 0.00% glusteriotwr22 index.so
[.]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms]
[k]
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is
different
from
what we observed earlier, how many seconds did you do perf
record
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
for?
Will
it be possible for you to do this for some more time? may be a
minute?
Just
want to be sure that the data actually represents what we are
observing.
I found one code path which on lookup does readdirs. Could you
give
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
me
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all
the
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
three
bricks? It can probably give a correlation to see if it is
indeed the
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
same
issue or not.
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have
same
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
number
of
lookups, so that's not it.
Is there any difference at all between the machines which
have
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
high
CPU
vs
low CPU?
I think the only other thing I would do is to install perf
tools
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
and
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may
have to
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
do
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
then
perf report -i
</path/to/output/given/in/the/previous/command>
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
On Mon, Aug 20, 2018 at 2:40 PM Hu Bert <
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But
I am
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert
Post by Hu Bert
I don't know what you exactly mean with workload, but
the
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
main
function of the volume is storing (incl. writing,
reading)
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
images
(from hundreds of bytes up to 30 MBs, overall ~7TB).
The work
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
is
done
by apache tomcat servers writing to / reading from the
volume.
Besides
images there are some text files and binaries that are
stored
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
on
the
volume and get updated regularly (every x hours); we'll
try
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads)
of the
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
same
brick on 2 of the gluster servers that consumes the CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12,
bricksdd1)
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
was
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared
to
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10
minutes when
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar
Karampuri
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot
of
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
CPU.
It
is
better
to
check what the volume profile is to see what is
leading
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile
info",
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and
attached
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes
where
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
you
are
observing
high
CPU usage and attach that file to this thread?
We
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
can
find
what
threads/processes are leading to high usage.
Do this
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are
no
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
log
entries
in
glustershd.log files anymore. According to
munin
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
disk
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were
in the
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
good
state)
fluctuates between 60 and 100; the server
with the
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
formerly
failed
disk has a load of 20-30.I've uploaded some
munin
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under
heavy
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
load
and
one
not
that much. Does anyone have an explanation of
this
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally
finished.
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Couldn't
see/find
any
related log message; is there such a
message in a
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
specific
log
file?
But i see the same behaviour when the last
heal
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
all
CPU
cores are consumed by brick processes; not
only
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
by
the
formerly
failed
bricksdd1, but by all 4 brick processes (and
their
threads).
Load
goes
up to > 100 on the 2 servers with the
not-failed
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
brick,
and
glustershd.log gets filled with a lot of
entries.
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
Load
on
the
server
with the then failed brick not that high,
but
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
still
~60.
Is this behaviour normal? Is there some
post-heal
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Post by Hu Bert
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-09-19 12:04:53 UTC
Permalink
Hi Pranith,

i recently upgraded to version 3.12.14, still no change in
load/performance. Have you received any feedback?

At the moment i have 3 options:
- problem can be fixed within version 3.12
- upgrade to 4.1 and magically/hopefully "fix" the problem (might not
help when problem is within brick)
- replace glusterfs with $whatever (defeat... :-( )

thx
Hubert
Post by Pranith Kumar Karampuri
Post by Hu Bert
Hi Pranith,
i just wanted to ask if you were able to get any feedback from your
colleagues :-)
Sorry, I didn't get a chance to. I am working on a customer issue which is
taking away cycles from any other work. Let me get back to you once I get
time this week.
Post by Hu Bert
btw.: we migrated some stuff (static resources, small files) to a nfs
server that we actually wanted to replace by glusterfs. Load and cpu
usage has gone down a bit, but still is asymmetric on the 3 gluster
servers.
Post by Hu Bert
Hm, i noticed that in the shared.log (volume log file) on gluster11
[2018-08-28 07:18:57.224367] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3054593291
[2018-08-28 07:19:17.733625] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2595205890
[2018-08-28 07:19:27.950355] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3105728076
[2018-08-28 07:19:42.519010] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3740415196
[2018-08-28 07:19:48.194774] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2922795043
[2018-08-28 07:19:52.506135] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 2841655539
[2018-08-28 07:19:55.466352] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3049465001
Don't know if that could be related.
Post by Hu Bert
a little update after about 2 hours of uptime: still/again high cpu
usage by one brick processes. server load >30.
gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far
gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
gluster13: high cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
The process for brick bricksdd1 consumes almost all 12 cores.
Interestingly there are more threads for the bricksdd1 process than
for the other bricks. Counted with "ps huH p <PID_OF_U_PROCESS> | wc
-l"
bricksda1 59 threads, bricksdb1 65 threads, bricksdc1 68 threads,
bricksdd1 85 threads
bricksda1 65 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 58 threads
bricksda1 61 threads, bricksdb1 60 threads, bricksdc1 61 threads,
bricksdd1_new 82 threads
Don't know if that could be relevant.
Post by Hu Bert
Good Morning,
today i update + rebooted all gluster servers, kernel update to
4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
gluster servers (gluster13) one of the bricks did come up at the
beginning but then lost connection.
Status of volume: shared
Gluster process TCP Port RDMA Port
Online Pid
------------------------------------------------------------------------------
[...]
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49155 0
Y 2136
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared N/A N/A
N N/A
Brick gluster13:/gluster/bricksdd1_new/shared
Status: Transport endpoint is not connected
Number of entries: -
reboot was at 06:15:39; brick then worked for a short period, but then
somehow disconnected.
[2018-08-28 04:27:36.944608] I [MSGID: 106005]
Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from
glusterd.
[2018-08-28 04:28:57.869666] I
[glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a
fresh brick process for brick /gluster/bricksdd1_new/shared
[2018-08-28 04:35:20.732666] I [MSGID: 106143]
[glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
/gluster/bricksdd1_new/shared on port 49157
Brick gluster11:/gluster/bricksdd1/shared 49155 0
Y 2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
type(0) op(0)
type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: signal
received: 11
2018-08-28 04:27:36
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: argp 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: backtrace 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: dlfcn 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: libpthread 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: llistxattr 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: setfsid 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: spinlock 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: epoll.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: xattr.h 1
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: st_atim.tv_nsec 1
package-string: glusterfs 3.12.13
Aug 28 06:27:36 gluster13 gluster-bricksdd1_new-shared[2136]: ---------
There are some errors+warnings in the shared.log (volume logfile), but
no error message telling me why
gluster13:/gluster/bricksdd1_new/shared has disconnected.
Well... at the moment load is ok, all 3 servers at about 15 (but i
think it will go up when more users will cause more traffic -> more
work on servers), 'gluster volume heal shared info' shows no entries,
Status of volume: shared
Gluster process TCP Port RDMA Port
Online Pid
------------------------------------------------------------------------------
Brick gluster11:/gluster/bricksda1/shared 49152 0 Y
2482
Brick gluster12:/gluster/bricksda1/shared 49152 0 Y
2088
Brick gluster13:/gluster/bricksda1/shared 49152 0 Y
2115
Brick gluster11:/gluster/bricksdb1/shared 49153 0 Y
2489
Brick gluster12:/gluster/bricksdb1/shared 49153 0 Y
2094
Brick gluster13:/gluster/bricksdb1/shared 49153 0 Y
2116
Brick gluster11:/gluster/bricksdc1/shared 49154 0 Y
2497
Brick gluster12:/gluster/bricksdc1/shared 49154 0 Y
2095
Brick gluster13:/gluster/bricksdc1/shared 49154 0 Y
2127
Brick gluster11:/gluster/bricksdd1/shared 49155 0 Y
2506
Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
Y 2097
Brick gluster13:/gluster/bricksdd1_new/shared 49157 0
Y 3994
Self-heal Daemon on localhost N/A N/A Y
4868
Self-heal Daemon on gluster12 N/A N/A Y
3813
Self-heal Daemon on gluster11 N/A N/A Y
5762
Task Status of Volume shared
------------------------------------------------------------------------------
There are no active volume tasks
Very strange. Thanks for reading if you've reached this line :-)
2018-08-23 13:58 GMT+02:00 Pranith Kumar Karampuri
Post by Hu Bert
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:46.291327] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:20:49.291575] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 - but that shoud
have been fixed in the 3.12.x release, and network is fine.
+Milind Changire
Post by Hu Bert
[2018-08-22 06:19:23.428520] I [cli.c:765:main] 0-cli: Started running
gluster with version 3.12.12
[2018-08-22 06:19:23.800895] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2018-08-22 06:19:23.800978] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22 06:19:23.809366] I [input.c:31:cli_batch] 0-: Exiting with: 0
Just wondered if this could related anyhow.
2018-08-21 8:17 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Hu Bert
Good morning :-)
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Jul 17 11:24
xattrop-c7c6f765-ce17-4361-95fb-2fd7f31c7b82
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
---------- 1 root root 0 Aug 16 07:54
xattrop-16b696a0-4214-4999-b277-0917c76c983e
And here's the output of 'perf ...' which ran almost a minute - file
grew pretty fast to a size of 17 GB and system load went up heavily.
Had to wait a while until load dropped :-)
load gluster11: ~90
load gluster12: ~10
load gluster13: ~50
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
[ perf record: Woken up 9837 times to write data ]
Processed 2137218 events and lost 33446 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 16576.374 MB
/tmp/perf.gluster11.bricksdd1.out (2047760 samples) ]
Here's an excerpt.
+ 1.93% 0.00% glusteriotwr0 [unknown] [k]
0xffffffffffffffff
+ 1.89% 0.00% glusteriotwr28 [unknown] [k]
0xffffffffffffffff
+ 1.86% 0.00% glusteriotwr15 [unknown] [k]
0xffffffffffffffff
+ 1.85% 0.00% glusteriotwr63 [unknown] [k]
0xffffffffffffffff
+ 1.83% 0.01% glusteriotwr0 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr38 [unknown] [k]
0xffffffffffffffff
+ 1.82% 0.01% glusteriotwr28 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.82% 0.00% glusteriotwr0 [kernel.kallsyms] [k]
do_syscall_64
+ 1.81% 0.00% glusteriotwr28 [kernel.kallsyms] [k]
do_syscall_64
+ 1.81% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.81% 0.00% glusteriotwr36 [unknown] [k]
0xffffffffffffffff
+ 1.80% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
do_syscall_64
+ 1.78% 0.01% glusteriotwr63 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.77% 0.00% glusteriotwr63 [kernel.kallsyms] [k]
do_syscall_64
+ 1.75% 0.01% glusteriotwr38 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.75% 0.00% glusteriotwr38 [kernel.kallsyms] [k]
do_syscall_64
+ 1.74% 0.00% glusteriotwr17 [unknown] [k]
0xffffffffffffffff
+ 1.74% 0.00% glusteriotwr44 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr6 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.00% glusteriotwr37 [unknown] [k]
0xffffffffffffffff
+ 1.73% 0.01% glusteriotwr36 [kernel.kallsyms] [k]
entry_SYSCALL_64_after_swapgs
+ 1.72% 0.00% glusteriotwr34 [unknown] [k]
0xffffffffffffffff
+ 1.72% 0.00% glusteriotwr36 [kernel.kallsyms] [k]
do_syscall_64
+ 1.71% 0.00% glusteriotwr45 [unknown] [k]
0xffffffffffffffff
+ 1.70% 0.00% glusteriotwr7 [unknown] [k]
0xffffffffffffffff
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
sys_getdents
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
filldir
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
0xffff80c60db8ef2b
+ 1.68% 0.00% glusteriotwr15 libc-2.24.so [.]
readdir64
+ 1.68% 0.00% glusteriotwr15 index.so [.]
0xffff80c6192a1888
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
iterate_dir
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_htree_fill_tree
+ 1.68% 0.00% glusteriotwr15 [kernel.kallsyms] [k]
ext4_readdir
Or do you want to download the file
/tmp/perf.gluster11.bricksdd1.out
and examine it yourself? If so i could send you a link.
Thank you! yes a link would be great. I am not as good with kernel side
of
things. So I will have to show this information to someone else
who
knows
these things so expect delay in response.
Post by Hu Bert
2018-08-21 7:13 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri
On Mon, Aug 20, 2018 at 3:20 PM Hu Bert
Post by Hu Bert
Regarding hardware the machines are identical. Intel Xeon
E5-1650
v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12
GBit/s
RAID Controller; operating system running on a raid1, then 4
disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
------------------------
perf record --call-graph=dwarf -p 7897 -o
/tmp/perf.gluster11.bricksdd1.out
^C[ perf record: Woken up 378 times to write data ]
Processed 83690 events and lost 96 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 423.087 MB
/tmp/perf.gluster11.bricksdd1.out (51744 samples) ]
------------------------
+ 8.10% 0.00% glusteriotwr22 [unknown]
[k]
0xffffffffffffffff
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
iterate_dir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
sys_getdents
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
filldir
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
do_syscall_64
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
entry_SYSCALL_64_after_swapgs
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so
[.]
0xffff80c60db8ef2b
+ 8.10% 0.00% glusteriotwr22 libc-2.24.so
[.]
readdir64
+ 8.10% 0.00% glusteriotwr22 index.so
[.]
0xffff80c6192a1888
+ 8.10% 0.04% glusteriotwr22 [kernel.kallsyms]
[k]
ext4_htree_fill_tree
+ 8.10% 0.00% glusteriotwr22 [kernel.kallsyms]
[k]
ext4_readdir
+ 7.95% 0.12% glusteriotwr22 [kernel.kallsyms]
[k]
htree_dirblock_to_tree
+ 5.78% 0.96% glusteriotwr22 [kernel.kallsyms]
[k]
__ext4_read_dirblock
+ 4.80% 0.02% glusteriotwr22 [kernel.kallsyms]
[k]
ext4_bread
+ 4.78% 0.04% glusteriotwr22 [kernel.kallsyms]
[k]
ext4_getblk
+ 4.72% 0.02% glusteriotwr22 [kernel.kallsyms]
[k]
__getblk_gfp
+ 4.57% 0.00% glusteriotwr3 [unknown]
[k]
0xffffffffffffffff
+ 4.55% 0.00% glusteriotwr3 [kernel.kallsyms]
[k]
do_syscall_64
Do you need different or additional information?
This looks like there are lot of readdirs going on which is
different
from
what we observed earlier, how many seconds did you do perf record
for?
Will
it be possible for you to do this for some more time? may be a
minute?
Just
want to be sure that the data actually represents what we are
observing.
I found one code path which on lookup does readdirs. Could you
give
me
the
output of ls -l <brick-path>/.glusterfs/indices/xattrop on all the
three
bricks? It can probably give a correlation to see if it is
indeed the
same
issue or not.
Post by Hu Bert
2018-08-20 11:20 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Even the brick which doesn't have high CPU seems to have
same
number
of
lookups, so that's not it.
Is there any difference at all between the machines which
have
high
CPU
vs
low CPU?
I think the only other thing I would do is to install perf
tools
and
try to
figure out the call-graph which is leading to so much CPU
This affects performance of the brick I think, so you may
have to
do
it
quickly and for less time.
perf record --call-graph=dwarf -p <brick-pid> -o
</path/to/output>
then
perf report -i
</path/to/output/given/in/the/previous/command>
On Mon, Aug 20, 2018 at 2:40 PM Hu Bert
Post by Hu Bert
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Looks good to me.
2018-08-20 10:51 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There are a lot of Lookup operations in the system. But
I am
not
able to
find why. Could you check the output of
# gluster volume heal <volname> info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert
Post by Hu Bert
I don't know what you exactly mean with workload, but
the
main
function of the volume is storing (incl. writing,
reading)
images
(from hundreds of bytes up to 30 MBs, overall ~7TB).
The work
is
done
by apache tomcat servers writing to / reading from the
volume.
Besides
images there are some text files and binaries that are
stored
on
the
volume and get updated regularly (every x hours); we'll
try
to
migrate
the latter ones to local storage asap.
Interestingly it's only one process (and its threads)
of the
same
brick on 2 of the gluster servers that consumes the
CPU.
gluster11: bricksdd1; not healed; full CPU
gluster12: bricksdd1; got healed; normal CPU
gluster13: bricksdd1; got healed; full CPU
Besides: performance during heal (e.g. gluster12,
bricksdd1)
was
way
better than it is now. I've attached 2 pngs showing the
differing
cpu
usage of last week before/after heal.
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
There seems to be too many lookup operations compared
to
any
other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert
Post by Hu Bert
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info
command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Please do volume profile also for around 10
minutes when
CPU%
is
high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar
Karampuri
Post by Pranith Kumar Karampuri
As per the output, all io-threads are using a lot
of
CPU.
It
is
better
to
check what the volume profile is to see what is
leading
to
so
much
work
for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section: "
Running GlusterFS Volume Profile Command"
and attach output of "gluster volume profile
info",
On Fri, Aug 17, 2018 at 11:24 AM Hu Bert
Good morning,
i ran the command during 100% CPU usage and
attached
the
file.
Hopefully it helps.
2018-08-17 7:33 GMT+02:00 Pranith Kumar
Karampuri
Post by Pranith Kumar Karampuri
Could you do the following on one of the nodes
where
you
are
observing
high
CPU usage and attach that file to this thread?
We
can
find
what
threads/processes are leading to high usage.
Do this
for
say
10
minutes
when
you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15, 2018 at 2:37 PM Hu Bert
Post by Hu Bert
Hello again :-)
The self heal must have finished as there are
no
log
entries
in
glustershd.log files anymore. According to
munin
disk
latency
(average
io wait) has gone down to 100 ms, and disk
utilization
has
gone
down
to ~60% - both on all servers and hard disks.
But now system load on 2 servers (which were
in the
good
state)
fluctuates between 60 and 100; the server
with the
formerly
failed
disk has a load of 20-30.I've uploaded some
munin
graphics of
the
cpu
https://abload.de/img/gluster11_cpu31d3a.png
https://abload.de/img/gluster12_cpu8sem7.png
https://abload.de/img/gluster13_cpud7eni.png
This can't be normal. 2 of the servers under
heavy
load
and
one
not
that much. Does anyone have an explanation of
this
strange
behaviour?
Thx :-)
2018-08-14 9:37 GMT+02:00 Hu Bert
Post by Hu Bert
Hi there,
well, it seems the heal has finally
finished.
Couldn't
see/find
any
related log message; is there such a
message in a
specific
log
file?
But i see the same behaviour when the last
heal
all
CPU
cores are consumed by brick processes; not
only
by
the
formerly
failed
bricksdd1, but by all 4 brick processes
(and
their
threads).
Load
goes
up to > 100 on the 2 servers with the
not-failed
brick,
and
glustershd.log gets filled with a lot of
entries.
Load
on
the
server
with the then failed brick not that high,
but
still
~60.
Is this behaviour normal? Is there some
post-heal
after
a
heal
has
finished?
thx in advance :-)
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
Hu Bert
2018-07-19 06:24:13 UTC
Permalink
Hi there,

sent this mail yesterday, but somehow it didn't work? Wasn't archived,
so please be indulgent it you receive this mail again :-)

We are currently running a replicate setup and are experiencing a
quite poor performance. It got even worse when within a couple of
weeks 2 bricks (disks) crashed. Maybe some general information of our
setup:

3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS on
separate disks); each server has 4 10TB disks -> each is a brick;
replica 3 setup (see gluster volume status below). Debian stretch,
kernel 4.9.0, gluster version 3.12.12. Servers and clients are
connected via 10 GBit ethernet.

About a month ago and 2 days ago a disk died (on different servers);
disk were replaced, were brought back into the volume and full self
heal started. But the speed for this is quite... disappointing. Each
brick has ~1.6TB of data on it (mostly the infamous small files). The
full heal i started yesterday copied only ~50GB within 24 hours (48
hours: about 100GB) - with
this rate it would take weeks until the self heal finishes.

After the first heal (started on gluster13 about a month ago, took
about 3 weeks) finished we had a terrible performance; CPU on one or
two of the nodes (gluster11, gluster12) was up to 1200%, consumed by
the brick process of the former crashed brick (bricksdd1),
interestingly not on the server with the failed this, but on the other
2 ones...

Well... am i doing something wrong? Some options wrongly configured?
Terrible setup? Anyone got an idea? Any additional information needed?


Thx in advance :-)

gluster volume status

Volume Name: shared
Type: Distributed-Replicate
Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: gluster11:/gluster/bricksda1/shared
Brick2: gluster12:/gluster/bricksda1/shared
Brick3: gluster13:/gluster/bricksda1/shared
Brick4: gluster11:/gluster/bricksdb1/shared
Brick5: gluster12:/gluster/bricksdb1/shared
Brick6: gluster13:/gluster/bricksdb1/shared
Brick7: gluster11:/gluster/bricksdc1/shared
Brick8: gluster12:/gluster/bricksdc1/shared
Brick9: gluster13:/gluster/bricksdc1/shared
Brick10: gluster11:/gluster/bricksdd1/shared
Brick11: gluster12:/gluster/bricksdd1_new/shared
Brick12: gluster13:/gluster/bricksdd1_new/shared
Options Reconfigured:
cluster.shd-max-threads: 4
performance.md-cache-timeout: 60
cluster.lookup-optimize: on
cluster.readdir-optimize: on
performance.cache-refresh-timeout: 4
performance.parallel-readdir: on
server.event-threads: 8
client.event-threads: 8
performance.cache-max-file-size: 128MB
performance.write-behind-window-size: 16MB
performance.io-thread-count: 64
cluster.min-free-disk: 1%
performance.cache-size: 24GB
nfs.disable: on
transport.address-family: inet
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 8
performance.io-cache: on
server.allow-insecure: on
performance.strict-o-direct: off
transport.listen-backlog: 100
server.outstanding-rpc-limit: 128
Loading...