[Gluster-users] RDMA inline threshold?

Discussion:

Stefan Solbrig

2018-05-29 21:20:51 UTC

Dear all,

I faced a problem with a glusterfs volume (pure distributed, _not_ dispersed) over RDMA transport. One user had a directory with a large number of files (50,000 files) and just doing an "ls" in this directory yields a "Transport endpoint not connected" error. The effect is, that "ls" only shows some files, but not all.

The respective log file shows this error message:

[2018-05-20 20:38:25.114978] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk] 0-glurch-client-0: remote operation failed [Transport endpoint is not connected]
[2018-05-20 20:38:27.732796] W [MSGID: 103046] [rdma.c:4089:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (10.100.245.18:49153), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048)
[2018-05-20 20:38:27.732844] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk] 0-glurch-client-3: remote operation failed [Transport endpoint is not connected]
[2018-05-20 20:38:27.733181] W [fuse-bridge.c:2897:fuse_readdirp_cbk] 0-glusterfs-fuse: 72882828: READDIRP => -1 (Transport endpoint is not connected)

I already set the memlock limit for glusterd to unlimited, but the problem persists.

Only going from RDMA transport to TCP transport solved the problem. (I'm running the volume now in mixed mode, config.transport=tcp,rdma). Mounting with transport=rdma shows this error, mouting with transport=tcp is fine.

however, this problem does not arise on all large directories, not on all. I didn't recognize a pattern yet.

I'm using glusterfs v3.12.6 on the servers, QDR Infiniband HCAs .

Is this a known issue with RDMA transport?

best wishes,
Stefan

Dan Lavu

2018-05-30 00:47:37 UTC

Permalink

Stefan,

Sounds like a brick process is not running. I have notice some strangeness
in my lab when using RDMA, I often have to forcibly restart the brick
process, often as in every single time I do a major operation, add a new
volume, remove a volume, stop a volume, etc.

gluster volume status <vol>

Does any of the self heal daemons show N/A? If that's the case, try forcing
a restart on the volume.

gluster volume start <vol> force

This will also explain why your volumes aren't being replicated properly.

Post by Stefan Solbrig
Dear all,
I faced a problem with a glusterfs volume (pure distributed, _not_
dispersed) over RDMA transport. One user had a directory with a large
number of files (50,000 files) and just doing an "ls" in this directory
yields a "Transport endpoint not connected" error. The effect is, that "ls"
only shows some files, but not all.
[2018-05-20 20:38:25.114978] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk]
0-glurch-client-0: remote operation failed [Transport endpoint is not
connected]
[2018-05-20 20:38:27.732796] W [MSGID: 103046]
[rdma.c:4089:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (
10.100.245.18:49153), couldn't encode or decode the msg properly or write
chunks were not provided for replies that were bigger than
RDMA_INLINE_THRESHOLD (2048)
[2018-05-20 20:38:27.732844] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk]
0-glurch-client-3: remote operation failed [Transport endpoint is not
connected]
[2018-05-20 20:38:27.733181] W [fuse-bridge.c:2897:fuse_readdirp_cbk]
0-glusterfs-fuse: 72882828: READDIRP => -1 (Transport endpoint is not
connected)
I already set the memlock limit for glusterd to unlimited, but the problem persists.
Only going from RDMA transport to TCP transport solved the problem. (I'm
running the volume now in mixed mode, config.transport=tcp,rdma). Mounting
with transport=rdma shows this error, mouting with transport=tcp is fine.
however, this problem does not arise on all large directories, not on all.
I didn't recognize a pattern yet.
I'm using glusterfs v3.12.6 on the servers, QDR Infiniband HCAs .
Is this a known issue with RDMA transport?
best wishes,
Stefan
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Dan Lavu

2018-05-30 01:00:26 UTC

Permalink

Post by Dan Lavu
Stefan,
Sounds like a brick process is not running. I have notice some strangeness
in my lab when using RDMA, I often have to forcibly restart the brick
process, often as in every single time I do a major operation, add a new
volume, remove a volume, stop a volume, etc.
gluster volume status <vol>
Does any of the self heal daemons show N/A? If that's the case, try
forcing a restart on the volume.
gluster volume start <vol> force
This will also explain why your volumes aren't being replicated properly.

Post by Stefan Solbrig
Dear all,
I faced a problem with a glusterfs volume (pure distributed, _not_
dispersed) over RDMA transport. One user had a directory with a large
number of files (50,000 files) and just doing an "ls" in this directory
yields a "Transport endpoint not connected" error. The effect is, that "ls"
only shows some files, but not all.
[2018-05-20 20:38:25.114978] W [MSGID: 114031]
remote operation failed [Transport endpoint is not connected]
[2018-05-20 20:38:27.732796] W [MSGID: 103046]
[rdma.c:4089:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (
10.100.245.18:49153), couldn't encode or decode the msg properly or
write chunks were not provided for replies that were bigger than
RDMA_INLINE_THRESHOLD (2048)
[2018-05-20 20:38:27.732844] W [MSGID: 114031]
remote operation failed [Transport endpoint is not connected]
[2018-05-20 20:38:27.733181] W [fuse-bridge.c:2897:fuse_readdirp_cbk]
0-glusterfs-fuse: 72882828: READDIRP => -1 (Transport endpoint is not
connected)
I already set the memlock limit for glusterd to unlimited, but the problem persists.
Only going from RDMA transport to TCP transport solved the problem. (I'm
running the volume now in mixed mode, config.transport=tcp,rdma). Mounting
with transport=rdma shows this error, mouting with transport=tcp is fine.
however, this problem does not arise on all large directories, not on
all. I didn't recognize a pattern yet.
I'm using glusterfs v3.12.6 on the servers, QDR Infiniband HCAs .
Is this a known issue with RDMA transport?
best wishes,
Stefan
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Stefan Solbrig

2018-05-30 10:34:01 UTC

Permalink

Dear Dan,

thanks for the quick reply!

I actually tried restarting all processes (and even rebooting all servers), but the error persists. I can also confirm that all birck processes are running. My volume is a distrubute-only volume (not dispersed, no sharding).

I also tried mounting with use_readdirp=no, because the error seems to be connected to readdirp, but this option does not change anything.

I found to options I might try: (gluster volume get myvolumename all | grep readdirp )
performance.force-readdirp true
dht.force-readdirp on
Can I turn off these safely? (or what precisely do they do?)

I also assured that all glusterd processes have unlimited locked memory.

Just to state it clearly: I do _not_ see any data corruption. Just the directory listings do not work (in very rare cases) with rdma transport:
"ls" shows only a part of the files.
but then I do:
stat /path/to/known/filename
it succeeds, and even
md5sum /path/to/known/filename/that/does/not/get/listed/with/ls
yields the correct result.

best wishes,
Stefan

Forgot to mention, sometimes I have to do force start other volumes as well, its hard to determine which brick process is locked up from the logs.
Status of volume: rhev_vms_primary
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49157 Y 15666
Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2542
Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2180
Self-heal Daemon on localhost N/A N/A N N/A << Brick process is not running on any node.
Self-heal Daemon on spidey.ib.runlevelone.lan N/A N/A N N/A
Self-heal Daemon on groot.ib.runlevelone.lan N/A N/A N N/A
Task Status of Volume rhev_vms_primary
------------------------------------------------------------------------------
There are no active volume tasks
3081 gluster volume start rhev_vms_noshards force
3082 gluster volume status
3083 gluster volume start rhev_vms_primary force
3084 gluster volume status
3085 gluster volume start rhev_vms_primary rhev_vms
3086 gluster volume start rhev_vms_primary rhev_vms force
Status of volume: rhev_vms_primary
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49157 Y 15666
Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2542
Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2180
Self-heal Daemon on localhost N/A N/A Y 8343
Self-heal Daemon on spidey.ib.runlevelone.lan N/A N/A Y 22381
Self-heal Daemon on groot.ib.runlevelone.lan N/A N/A Y 20633
Finally..
Dan
Stefan,
Sounds like a brick process is not running. I have notice some strangeness in my lab when using RDMA, I often have to forcibly restart the brick process, often as in every single time I do a major operation, add a new volume, remove a volume, stop a volume, etc.
gluster volume status <vol>
Does any of the self heal daemons show N/A? If that's the case, try forcing a restart on the volume.
gluster volume start <vol> force
This will also explain why your volumes aren't being replicated properly.
Dear all,
I faced a problem with a glusterfs volume (pure distributed, _not_ dispersed) over RDMA transport. One user had a directory with a large number of files (50,000 files) and just doing an "ls" in this directory yields a "Transport endpoint not connected" error. The effect is, that "ls" only shows some files, but not all.
[2018-05-20 20:38:25.114978] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk] 0-glurch-client-0: remote operation failed [Transport endpoint is not connected]
[2018-05-20 20:38:27.732796] W [MSGID: 103046] [rdma.c:4089:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (10.100.245.18:49153), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048)
[2018-05-20 20:38:27.732844] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk] 0-glurch-client-3: remote operation failed [Transport endpoint is not connected]
[2018-05-20 20:38:27.733181] W [fuse-bridge.c:2897:fuse_readdirp_cbk] 0-glusterfs-fuse: 72882828: READDIRP => -1 (Transport endpoint is not connected)
I already set the memlock limit for glusterd to unlimited, but the problem persists.
Only going from RDMA transport to TCP transport solved the problem. (I'm running the volume now in mixed mode, config.transport=tcp,rdma). Mounting with transport=rdma shows this error, mouting with transport=tcp is fine.
however, this problem does not arise on all large directories, not on all. I didn't recognize a pattern yet.
I'm using glusterfs v3.12.6 on the servers, QDR Infiniband HCAs .
Is this a known issue with RDMA transport?
best wishes,
Stefan
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Dan Lavu

2018-05-30 14:20:42 UTC

Permalink

Stefan,

We'll have to let somebody else chime in. I don't work on this project,
just another user, enthusiast and I've spent, still spending much time
tuning my own RDMA gluster configuration. In short, I won't have an answer
for you. If nobody can answer, I'd suggest filing a bug, that way it can be
tracked and reviewed by developers.

- Dan

Post by Stefan Solbrig
Dear Dan,
thanks for the quick reply!
I actually tried restarting all processes (and even rebooting all
servers), but the error persists. I can also confirm that all birck
processes are running. My volume is a distrubute-only volume (not
dispersed, no sharding).
I also tried mounting with use_readdirp=no, because the error seems to be
connected to readdirp, but this option does not change anything.
I found to options I might try: (gluster volume get myvolumename all | grep readdirp )
performance.force-readdirp true
dht.force-readdirp on
Can I turn off these safely? (or what precisely do they do?)
I also assured that all glusterd processes have unlimited locked memory.
Just to state it clearly: I do _not_ see any data corruption. Just the
"ls" shows only a part of the files.
stat /path/to/known/filename
it succeeds, and even
md5sum /path/to/known/filename/that/does/not/get/listed/with/ls
yields the correct result.
best wishes,
Stefan

Post by Dan Lavu
Forgot to mention, sometimes I have to do force start other volumes as

well, its hard to determine which brick process is locked up from the logs.

Post by Dan Lavu
Status of volume: rhev_vms_primary
Gluster process

TCP Port RDMA Port Online Pid

Post by Dan Lavu
------------------------------------------------------------

------------------

Post by Dan Lavu
Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary

0 49157 Y 15666

Post by Dan Lavu
Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary

0 49156 Y 2542

Post by Dan Lavu
Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary

0 49156 Y 2180

Post by Dan Lavu
Self-heal Daemon on localhost

N/A N/A N N/A << Brick process is
not running on any node.

Post by Dan Lavu
Self-heal Daemon on spidey.ib.runlevelone.lan

N/A N/A N N/A

Post by Dan Lavu
Self-heal Daemon on groot.ib.runlevelone.lan

N/A N/A N N/A

Post by Dan Lavu
Task Status of Volume rhev_vms_primary
------------------------------------------------------------

------------------

Post by Dan Lavu
There are no active volume tasks
3081 gluster volume start rhev_vms_noshards force
3082 gluster volume status
3083 gluster volume start rhev_vms_primary force
3084 gluster volume status
3085 gluster volume start rhev_vms_primary rhev_vms
3086 gluster volume start rhev_vms_primary rhev_vms force
Status of volume: rhev_vms_primary
Gluster process

TCP Port RDMA Port Online Pid

Post by Dan Lavu
------------------------------------------------------------

------------------

Post by Dan Lavu
Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary

0 49157 Y 15666

Post by Dan Lavu
Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary

0 49156 Y 2542

Post by Dan Lavu
Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary

0 49156 Y 2180

Post by Dan Lavu
Self-heal Daemon on localhost

N/A N/A Y 8343

Post by Dan Lavu
Self-heal Daemon on spidey.ib.runlevelone.lan

N/A N/A Y 22381

Post by Dan Lavu
Self-heal Daemon on groot.ib.runlevelone.lan

N/A N/A Y 20633

Post by Dan Lavu
Finally..
Dan
Stefan,
Sounds like a brick process is not running. I have notice some

strangeness in my lab when using RDMA, I often have to forcibly restart the
brick process, often as in every single time I do a major operation, add a
new volume, remove a volume, stop a volume, etc.

Post by Dan Lavu
gluster volume status <vol>
Does any of the self heal daemons show N/A? If that's the case, try

forcing a restart on the volume.

Post by Dan Lavu
gluster volume start <vol> force
This will also explain why your volumes aren't being replicated

properly.

Post by Dan Lavu
Dear all,
I faced a problem with a glusterfs volume (pure distributed, _not_

dispersed) over RDMA transport. One user had a directory with a large
number of files (50,000 files) and just doing an "ls" in this directory
yields a "Transport endpoint not connected" error. The effect is, that "ls"
only shows some files, but not all.

Post by Dan Lavu
[2018-05-20 20:38:25.114978] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk]

0-glurch-client-0: remote operation failed [Transport endpoint is not
connected]

Post by Dan Lavu
[2018-05-20 20:38:27.732796] W [MSGID: 103046]

[rdma.c:4089:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (
10.100.245.18:49153), couldn't encode or decode the msg properly or write
chunks were not provided for replies that were bigger than
RDMA_INLINE_THRESHOLD (2048)

Post by Dan Lavu
[2018-05-20 20:38:27.732844] W [MSGID: 114031] [client-rpc-fops.c:2578:client3_3_readdirp_cbk]

0-glurch-client-3: remote operation failed [Transport endpoint is not
connected]

Post by Dan Lavu
[2018-05-20 20:38:27.733181] W [fuse-bridge.c:2897:fuse_readdirp_cbk]

0-glusterfs-fuse: 72882828: READDIRP => -1 (Transport endpoint is not
connected)

Post by Dan Lavu
I already set the memlock limit for glusterd to unlimited, but the

problem persists.

Post by Dan Lavu
Only going from RDMA transport to TCP transport solved the problem.

(I'm running the volume now in mixed mode, config.transport=tcp,rdma).
Mounting with transport=rdma shows this error, mouting with transport=tcp
is fine.

Post by Dan Lavu
however, this problem does not arise on all large directories, not on

all. I didn't recognize a pattern yet.

Post by Dan Lavu
I'm using glusterfs v3.12.6 on the servers, QDR Infiniband HCAs .
Is this a known issue with RDMA transport?
best wishes,
Stefan
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users