Discussion:
[Gluster-users] RDMA Client Hang Problem
Necati E. SISECI
2018-04-25 06:53:50 UTC
Permalink
Dear Gluster-Users,

I am experiencing RDMA problems.

I have installed Ubuntu 16.04.4 running with 4.15.0-13-generic kernel,
MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64 to 4 different servers.
All of them has Mellanox ConnectX-4 LX dual port NICs. These four
servers are connected via Mellanox SN2100 Switch.

I have installed GlusterFS Server v3.10 (from Ubuntu PPA) to 3 servers.
These 3 boxes are running as gluster cluster. Additionally, I have
installed Glusterfs Client to the last one.

I have created Gluster Volume with this command:

# gluster volume create db transport rdma replica 3 arbiter 1
gluster1:/storage/db/ gluster2:/storage/db/ cinder:/storage/db force

(network.ping-timeout is 3)

Then I have mounted this volume using mount command below.

mount -t glusterfs -o transport=rdma gluster1:/db /db

After mountings "/db", I can access the files.

The problem is, when I reboot one of the cluster nodes, fuse client
gives this error below and hangs.

[2018-04-17 07:42:55.506422] W [MSGID: 103070]
[rdma.c:4284:gf_rdma_handle_failed_send_completion]
0-rpc-transport/rdma: *send work request on `mlx5_0' returned error
wc.status = 5, wc.vendor_err = 245, post->buf = 0x7f8b92016000,
wc.byte_len = 0, post->reused = 135*

When I change transport mode from rdma to tcp, fuse client works well.
No hangs.

I also tried Gluster 3.8, 3.10, 4.0.0 and 4.0.1 (from Ubuntu PPAs) on
Ubuntu 16.04.4 and Centos 7.4. But results were the same.

Thank you.

Necati.
Raghavendra Gowdappa
2018-04-25 09:27:57 UTC
Permalink
Is infiniband itself working fine? You can run tools like ibv_rc_pingpong
to find out.
Post by Necati E. SISECI
Dear Gluster-Users,
I am experiencing RDMA problems.
I have installed Ubuntu 16.04.4 running with 4.15.0-13-generic kernel,
MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64 to 4 different servers.
All of them has Mellanox ConnectX-4 LX dual port NICs. These four servers
are connected via Mellanox SN2100 Switch.
I have installed GlusterFS Server v3.10 (from Ubuntu PPA) to 3 servers.
These 3 boxes are running as gluster cluster. Additionally, I have
installed Glusterfs Client to the last one.
# gluster volume create db transport rdma replica 3 arbiter 1
gluster1:/storage/db/ gluster2:/storage/db/ cinder:/storage/db force
(network.ping-timeout is 3)
Then I have mounted this volume using mount command below.
mount -t glusterfs -o transport=rdma gluster1:/db /db
After mountings "/db", I can access the files.
The problem is, when I reboot one of the cluster nodes, fuse client gives
this error below and hangs.
[2018-04-17 07:42:55.506422] W [MSGID: 103070] [rdma.c:4284:gf_rdma_handle_failed_send_completion]
0-rpc-transport/rdma: *send work request on `mlx5_0' returned error
wc.status = 5, wc.vendor_err = 245, post->buf = 0x7f8b92016000, wc.byte_len
= 0, post->reused = 135*
When I change transport mode from rdma to tcp, fuse client works well. No
hangs.
I also tried Gluster 3.8, 3.10, 4.0.0 and 4.0.1 (from Ubuntu PPAs) on
Ubuntu 16.04.4 and Centos 7.4. But results were the same.
Thank you.
Necati.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Necati E. SISECI
2018-04-25 12:35:11 UTC
Permalink
Thank you for your mail.

ibv_rc_pingpong seems working between servers and client. Also udaddy,
ucmatose, rping etc are working.

***@gluster1:~# ibv_rc_pingpong -d mlx5_0 -g 0
  local address:  LID 0x0000, QPN 0x0001e4, PSN 0x10090e, GID
fe80::ee0d:9aff:fec0:1dc8
  remote address: LID 0x0000, QPN 0x00014c, PSN 0x09402b, GID
fe80::ee0d:9aff:fec0:1b14
8192000 bytes in 0.01 seconds = 7964.03 Mbit/sec
1000 iters in 0.01 seconds = 8.23 usec/iter

***@cinder:~# ibv_rc_pingpong -g 0 -d mlx5_0 gluster1
  local address:  LID 0x0000, QPN 0x00014c, PSN 0x09402b, GID
fe80::ee0d:9aff:fec0:1b14
  remote address: LID 0x0000, QPN 0x0001e4, PSN 0x10090e, GID
fe80::ee0d:9aff:fec0:1dc8
8192000 bytes in 0.01 seconds = 8424.73 Mbit/sec
1000 iters in 0.01 seconds = 7.78 usec/iter


Thank you.

Necati.
Post by Raghavendra Gowdappa
Is infiniband itself working fine? You can run tools like
ibv_rc_pingpong to find out.
Dear Gluster-Users,
I am experiencing RDMA problems.
I have installed Ubuntu 16.04.4 running with 4.15.0-13-generic
kernel, MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64 to 4
different servers. All of them has Mellanox ConnectX-4 LX dual
port NICs. These four servers are connected via Mellanox SN2100
Switch.
I have installed GlusterFS Server v3.10 (from Ubuntu PPA) to 3
servers. These 3 boxes are running as gluster cluster.
Additionally, I have installed Glusterfs Client to the last one.
# gluster volume create db transport rdma replica 3 arbiter 1
gluster1:/storage/db/ gluster2:/storage/db/ cinder:/storage/db force
(network.ping-timeout is 3)
Then I have mounted this volume using mount command below.
mount -t glusterfs -o transport=rdma gluster1:/db /db
After mountings "/db", I can access the files.
The problem is, when I reboot one of the cluster nodes, fuse
client gives this error below and hangs.
[2018-04-17 07:42:55.506422] W [MSGID: 103070]
[rdma.c:4284:gf_rdma_handle_failed_send_completion]
0-rpc-transport/rdma: *send work request on `mlx5_0' returned
error wc.status = 5, wc.vendor_err = 245, post->buf =
0x7f8b92016000, wc.byte_len = 0, post->reused = 135*
When I change transport mode from rdma to tcp, fuse client works
well. No hangs.
I also tried Gluster 3.8, 3.10, 4.0.0 and 4.0.1 (from Ubuntu PPAs)
on Ubuntu 16.04.4 and Centos 7.4. But results were the same.
Thank you.
Necati.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
Raghavendra Gowdappa
2018-04-26 01:14:57 UTC
Permalink
+Amar, +Rafi - Other maintainers and Peers of transport/rdma

* Can you attach logs from client and brick? Please set
diagnostics.client-log-level and diagnostics.brick-log-level to TRACE
before starting your tests.
* Does fuse client recover from hang?

I think we might not be handling the poll_err path correctly. The fact that
we see issues only after brick reboots we are seeing the issues, makes me
suspect the error path.

regards,
Raghavendra
Post by Necati E. SISECI
Thank you for your mail.
ibv_rc_pingpong seems working between servers and client. Also udaddy,
ucmatose, rping etc are working.
local address: LID 0x0000, QPN 0x0001e4, PSN 0x10090e, GID
fe80::ee0d:9aff:fec0:1dc8
remote address: LID 0x0000, QPN 0x00014c, PSN 0x09402b, GID
fe80::ee0d:9aff:fec0:1b14
8192000 bytes in 0.01 seconds = 7964.03 Mbit/sec
1000 iters in 0.01 seconds = 8.23 usec/iter
local address: LID 0x0000, QPN 0x00014c, PSN 0x09402b, GID
fe80::ee0d:9aff:fec0:1b14
remote address: LID 0x0000, QPN 0x0001e4, PSN 0x10090e, GID
fe80::ee0d:9aff:fec0:1dc8
8192000 bytes in 0.01 seconds = 8424.73 Mbit/sec
1000 iters in 0.01 seconds = 7.78 usec/iter
Thank you.
Necati.
Is infiniband itself working fine? You can run tools like ibv_rc_pingpong
to find out.
Post by Necati E. SISECI
Dear Gluster-Users,
I am experiencing RDMA problems.
I have installed Ubuntu 16.04.4 running with 4.15.0-13-generic kernel,
MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64 to 4 different servers.
All of them has Mellanox ConnectX-4 LX dual port NICs. These four servers
are connected via Mellanox SN2100 Switch.
I have installed GlusterFS Server v3.10 (from Ubuntu PPA) to 3 servers.
These 3 boxes are running as gluster cluster. Additionally, I have
installed Glusterfs Client to the last one.
# gluster volume create db transport rdma replica 3 arbiter 1
gluster1:/storage/db/ gluster2:/storage/db/ cinder:/storage/db force
(network.ping-timeout is 3)
Then I have mounted this volume using mount command below.
mount -t glusterfs -o transport=rdma gluster1:/db /db
After mountings "/db", I can access the files.
The problem is, when I reboot one of the cluster nodes, fuse client gives
this error below and hangs.
[2018-04-17 07:42:55.506422] W [MSGID: 103070]
[rdma.c:4284:gf_rdma_handle_failed_send_completion]
0-rpc-transport/rdma: *send work request on `mlx5_0' returned error
wc.status = 5, wc.vendor_err = 245, post->buf = 0x7f8b92016000, wc.byte_len
= 0, post->reused = 135*
When I change transport mode from rdma to tcp, fuse client works well. No
hangs.
I also tried Gluster 3.8, 3.10, 4.0.0 and 4.0.1 (from Ubuntu PPAs) on
Ubuntu 16.04.4 and Centos 7.4. But results were the same.
Thank you.
Necati.
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Loading...