Discussion:
[Gluster-users] Gluster distributed replicated setup does not serve read from all bricks belonging to the same replica
Anh Vo
2018-11-22 04:50:32 UTC
Permalink
Hi,
Our setup: We have a distributed replicated setup of 3 replica. The total
number of servers varies between clusters, in some cases we have a total of
36 (12 x 3) servers, in some of them we have 12 servers (4 x 3). We're
using gluster 3.12.15

In all instances what I am noticing is that only one member of the replica
is serving read for a particular file, even when all the members of the
replica set is online. We have many large input files (for example: 150GB
zip file) and when there are 50 clients reading from one single server the
performance degrades by several magnitude for reading that file only.
Shouldn't all members of the replica participate in serving the read
requests?

Our options

cluster.shd-max-threads: 1
cluster.heal-timeout: 900
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
features.inode-quota: off
features.quota: off
transport.listen-backlog: 100
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.strict-o-direct: on
network.remote-dio: off
server.allow-insecure: on
performance.write-behind: off
cluster.nufa: disable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.ensure-durability: off
cluster.self-heal-window-size: 32
cluster.favorite-child-policy: mtime
performance.io-thread-count: 32
cluster.eager-lock: off
server.outstanding-rpc-limit: 128
cluster.rebal-throttle: aggressive
server.event-threads: 3
client.event-threads: 3
performance.cache-size: 6GB
cluster.readdir-optimize: on
storage.build-pgfid: on
Ravishankar N
2018-11-22 05:57:07 UTC
Permalink
Hi,
If there are multiple clients , you can change the
'cluster.read-hash-mode' volume option's value to 2. Then different
reads should be served from different bricks for different clients. The
meaning of various values for 'cluster.read-hash-mode' can be got from
`gluster volume set help`. gluster-4.1 also has added a new value[1] to
this option. Of course, the assumption is that all bricks host good
copies (i.e. there are no self-heals pending).

Hope this helps,
Ravi

[1]  https://review.gluster.org/#/c/glusterfs/+/19698/
Post by Anh Vo
Hi,
Our setup: We have a distributed replicated setup of 3 replica. The
total number of servers varies between clusters, in some cases we have
a total of 36 (12 x 3) servers, in some of them we have 12 servers (4
x 3). We're using gluster 3.12.15
In all instances what I am noticing is that only one member of the
replica is serving read for a particular file, even when all the
members of the replica set is online. We have many large input files
(for example: 150GB zip file) and when there are 50 clients reading
from one single server the performance degrades by several magnitude
for reading that file only. Shouldn't all members of the replica
participate in serving the read requests?
Our options
cluster.shd-max-threads: 1
cluster.heal-timeout: 900
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
features.inode-quota: off
features.quota: off
transport.listen-backlog: 100
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.strict-o-direct: on
network.remote-dio: off
server.allow-insecure: on
performance.write-behind: off
cluster.nufa: disable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.ensure-durability: off
cluster.self-heal-window-size: 32
cluster.favorite-child-policy: mtime
performance.io-thread-count: 32
cluster.eager-lock: off
server.outstanding-rpc-limit: 128
cluster.rebal-throttle: aggressive
server.event-threads: 3
client.event-threads: 3
performance.cache-size: 6GB
cluster.readdir-optimize: on
storage.build-pgfid: on
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Anh Vo
2018-11-22 13:37:39 UTC
Permalink
Thanks Ravi, I will try that option.
One question:
Let's say there are self heal pending, how would the default of "0" have
worked? I understand 0 means "first responder" What if first responder
doesn't have good copy? (and it failed in such a way that the dirty
attribute wasn't set on its copy - but there are index heal pending from
the other two sources)
Post by Ravishankar N
Hi,
If there are multiple clients , you can change the
'cluster.read-hash-mode' volume option's value to 2. Then different reads
should be served from different bricks for different clients. The meaning
of various values for 'cluster.read-hash-mode' can be got from `gluster
volume set help`. gluster-4.1 also has added a new value[1] to this option.
Of course, the assumption is that all bricks host good copies (i.e. there
are no self-heals pending).
Hope this helps,
Ravi
[1] https://review.gluster.org/#/c/glusterfs/+/19698/
Hi,
Our setup: We have a distributed replicated setup of 3 replica. The total
number of servers varies between clusters, in some cases we have a total of
36 (12 x 3) servers, in some of them we have 12 servers (4 x 3). We're
using gluster 3.12.15
In all instances what I am noticing is that only one member of the replica
is serving read for a particular file, even when all the members of the
replica set is online. We have many large input files (for example: 150GB
zip file) and when there are 50 clients reading from one single server the
performance degrades by several magnitude for reading that file only.
Shouldn't all members of the replica participate in serving the read
requests?
Our options
cluster.shd-max-threads: 1
cluster.heal-timeout: 900
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
features.inode-quota: off
features.quota: off
transport.listen-backlog: 100
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.strict-o-direct: on
network.remote-dio: off
server.allow-insecure: on
performance.write-behind: off
cluster.nufa: disable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.ensure-durability: off
cluster.self-heal-window-size: 32
cluster.favorite-child-policy: mtime
performance.io-thread-count: 32
cluster.eager-lock: off
server.outstanding-rpc-limit: 128
cluster.rebal-throttle: aggressive
server.event-threads: 3
client.event-threads: 3
performance.cache-size: 6GB
cluster.readdir-optimize: on
storage.build-pgfid: on
_______________________________________________
Ravishankar N
2018-11-23 03:58:58 UTC
Permalink
Post by Anh Vo
Thanks Ravi, I will try that option.
Let's say there are self heal pending, how would the default of "0"
have worked? I understand 0 means "first responder" What if first
responder doesn't have good copy? (and it failed in such a way that
the dirty attribute wasn't set on its copy - but there are index heal
pending from the other two sources)
0 = first readable child of AFR, starting from 1st child. So if 1st
brick doesn't have the good copy, it will try the 2nd brick and so on.
The default value seems to be '1' not '0'. You can look at
afr_read_subvol_select_by_policy() in the source code to understand the
preference of selection.

Regards,
Ravi
Post by Anh Vo
Hi,
If there are multiple clients , you can change the
'cluster.read-hash-mode' volume option's value to 2. Then
different reads should be served from different bricks for
different clients. The meaning of various values for
'cluster.read-hash-mode' can be got from `gluster volume set
help`. gluster-4.1 also has added a new value[1] to this option.
Of course, the assumption is that all bricks host good copies
(i.e. there are no self-heals pending).
Hope this helps,
Ravi
[1] https://review.gluster.org/#/c/glusterfs/+/19698/
Post by Anh Vo
Hi,
Our setup: We have a distributed replicated setup of 3 replica.
The total number of servers varies between clusters, in
some cases we have a total of 36 (12 x 3) servers, in some of
them we have 12 servers (4 x 3). We're using gluster 3.12.15
In all instances what I am noticing is that only one member of
the replica is serving read for a particular file, even when all
the members of the replica set is online. We have many large
input files (for example: 150GB zip file) and when there are 50
clients reading from one single server the performance degrades
by several magnitude for reading that file only. Shouldn't all
members of the replica participate in serving the read requests?
Our options
cluster.shd-max-threads: 1
cluster.heal-timeout: 900
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
features.inode-quota: off
features.quota: off
transport.listen-backlog: 100
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.strict-o-direct: on
network.remote-dio: off
server.allow-insecure: on
performance.write-behind: off
cluster.nufa: disable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.ensure-durability: off
cluster.self-heal-window-size: 32
cluster.favorite-child-policy: mtime
performance.io-thread-count: 32
cluster.eager-lock: off
server.outstanding-rpc-limit: 128
cluster.rebal-throttle: aggressive
server.event-threads: 3
client.event-threads: 3
performance.cache-size: 6GB
cluster.readdir-optimize: on
storage.build-pgfid: on
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Anh Vo
2018-11-24 07:33:47 UTC
Permalink
Looking at the source (afr-common.c) even in the case of using hashed mode
and the hashed brick doesn't have a good copy it will try the next brick am
I correct? I'm curious because your first reply seemed to place some
significance on the part about pending self-heal. Is there anything about
pending self-heal that would have made hashed mode worse, or is it about as
bad as any brick selection policy?

Thanks
Post by Anh Vo
Thanks Ravi, I will try that option.
Let's say there are self heal pending, how would the default of "0" have
worked? I understand 0 means "first responder" What if first responder
doesn't have good copy? (and it failed in such a way that the dirty
attribute wasn't set on its copy - but there are index heal pending from
the other two sources)
0 = first readable child of AFR, starting from 1st child. So if 1st brick
doesn't have the good copy, it will try the 2nd brick and so on.
The default value seems to be '1' not '0'. You can look at
afr_read_subvol_select_by_policy() in the source code to understand the
preference of selection.
Regards,
Ravi
Post by Ravishankar N
Hi,
If there are multiple clients , you can change the
'cluster.read-hash-mode' volume option's value to 2. Then different reads
should be served from different bricks for different clients. The meaning
of various values for 'cluster.read-hash-mode' can be got from `gluster
volume set help`. gluster-4.1 also has added a new value[1] to this option.
Of course, the assumption is that all bricks host good copies (i.e. there
are no self-heals pending).
Hope this helps,
Ravi
[1] https://review.gluster.org/#/c/glusterfs/+/19698/
Hi,
Our setup: We have a distributed replicated setup of 3 replica. The total
number of servers varies between clusters, in some cases we have a total of
36 (12 x 3) servers, in some of them we have 12 servers (4 x 3). We're
using gluster 3.12.15
In all instances what I am noticing is that only one member of the
replica is serving read for a particular file, even when all the members of
150GB zip file) and when there are 50 clients reading from one single
server the performance degrades by several magnitude for reading that file
only. Shouldn't all members of the replica participate in serving the read
requests?
Our options
cluster.shd-max-threads: 1
cluster.heal-timeout: 900
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
features.inode-quota: off
features.quota: off
transport.listen-backlog: 100
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.strict-o-direct: on
network.remote-dio: off
server.allow-insecure: on
performance.write-behind: off
cluster.nufa: disable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.ensure-durability: off
cluster.self-heal-window-size: 32
cluster.favorite-child-policy: mtime
performance.io-thread-count: 32
cluster.eager-lock: off
server.outstanding-rpc-limit: 128
cluster.rebal-throttle: aggressive
server.event-threads: 3
client.event-threads: 3
performance.cache-size: 6GB
cluster.readdir-optimize: on
storage.build-pgfid: on
_______________________________________________
Ravishankar N
2018-11-24 08:57:20 UTC
Permalink
Post by Anh Vo
Looking at the source (afr-common.c) even in the case of using hashed
mode and the hashed brick doesn't have a good copy it will try the
next brick am I correct?
That is correct, no matter which brick the policy chooses,  if that
brick is not readable for a given file (i.e. a heal is pending on it
from the other good bricks), we just iterate from brick-0, and pick the
first one that is good (i.e. readable).
-Ravi
Post by Anh Vo
I'm curious because your first reply seemed to place some significance
on the part about pending self-heal. Is there anything about pending
self-heal that would have made hashed mode worse, or is it about as
bad as any brick selection policy?
Thanks
Post by Anh Vo
Thanks Ravi, I will try that option.
Let's say there are self heal pending, how would the default of
"0" have worked? I understand 0 means "first responder" What if
first responder doesn't have good copy? (and it failed in such a
way that the dirty attribute wasn't set on its copy - but there
are index heal pending from the other two sources)
0 = first readable child of AFR, starting from 1st child. So if
1st brick doesn't have the good copy, it will try the 2nd brick
and so on.
The default value seems to be '1' not '0'. You can look at
afr_read_subvol_select_by_policy() in the source code to
understand the preference of selection.
Regards,
Ravi
Post by Anh Vo
On Wed, Nov 21, 2018 at 9:57 PM Ravishankar N
Hi,
If there are multiple clients , you can change the
'cluster.read-hash-mode' volume option's value to 2. Then
different reads should be served from different bricks for
different clients. The meaning of various values for
'cluster.read-hash-mode' can be got from `gluster volume set
help`. gluster-4.1 also has added a new value[1] to this
option. Of course, the assumption is that all bricks host
good copies (i.e. there are no self-heals pending).
Hope this helps,
Ravi
[1] https://review.gluster.org/#/c/glusterfs/+/19698/
Post by Anh Vo
Hi,
Our setup: We have a distributed replicated setup of 3
replica. The total number of servers varies between
clusters, in some cases we have a total of 36 (12 x 3)
servers, in some of them we have 12 servers (4 x 3). We're
using gluster 3.12.15
In all instances what I am noticing is that only one member
of the replica is serving read for a particular file, even
when all the members of the replica set is online. We have
many large input files (for example: 150GB zip file) and
when there are 50 clients reading from one single server the
performance degrades by several magnitude for reading that
file only. Shouldn't all members of the replica participate
in serving the read requests?
Our options
cluster.shd-max-threads: 1
cluster.heal-timeout: 900
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
features.inode-quota: off
features.quota: off
transport.listen-backlog: 100
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.strict-o-direct: on
network.remote-dio: off
server.allow-insecure: on
performance.write-behind: off
cluster.nufa: disable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.ensure-durability: off
cluster.self-heal-window-size: 32
cluster.favorite-child-policy: mtime
performance.io-thread-count: 32
cluster.eager-lock: off
server.outstanding-rpc-limit: 128
cluster.rebal-throttle: aggressive
server.event-threads: 3
client.event-threads: 3
performance.cache-size: 6GB
cluster.readdir-optimize: on
storage.build-pgfid: on
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Loading...