Discussion:
gluster connection interrupted during transfer
(too old to reply)
Richard Neuboeck
2018-08-29 12:41:08 UTC
Permalink
Hi Gluster Community,

I have problems with a glusterfs 'Transport endpoint not connected'
connection abort during file transfers that I can replicate (all the
time now) but not pinpoint as to why this is happening.

The volume is set up in replica 3 mode and accessed with the fuse
gluster client. Both client and server are running CentOS and the
supplied 3.12.11 version of gluster.

The connection abort happens at different times during rsync but
occurs every time I try to sync all our files (1.1TB) to the empty
volume.

Client and server side I don't find errors in the gluster log files.
rsync logs the obvious transfer problem. The only log that shows
anything related is the server brick log which states that the
connection is shutting down:

[2018-08-18 22:40:35.502510] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-home-server: disconnecting
connection from brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
[2018-08-18 22:40:35.502620] W
[inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
[2018-08-18 22:40:35.502692] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.502719] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-home-server: Shutting down
connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0

Since I'm running another replica 3 setup for oVirt for a long time
now which is completely stable I thought I made a mistake setting
different options at first. However even when I reset those options
I'm able to reproduce the connection problem.

The unoptimized volume setup looks like this:

Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%


The following additional options were used before:

performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on


In this case the gluster servers and also the client is using a
bonded network device running in adaptive load balancing mode.

I've tried using the debug option for the client mount. But except
for a ~0.5TB log file I didn't get information that seems helpful to me.

Transferring just a couple of GB works without problems.

It may very well be that I'm already blind to the obvious but after
many long running tests I can't find the crux in the setup.

Does anyone have an idea as how to approach this problem in a way
that sheds some useful information?

Any help is highly appreciated!
Cheers
Richard
--
/dev/null
Nithya Balachandran
2018-08-30 07:45:07 UTC
Permalink
Hi Richard,
Post by Richard Neuboeck
Hi Gluster Community,
I have problems with a glusterfs 'Transport endpoint not connected'
connection abort during file transfers that I can replicate (all the
time now) but not pinpoint as to why this is happening.
The volume is set up in replica 3 mode and accessed with the fuse
gluster client. Both client and server are running CentOS and the
supplied 3.12.11 version of gluster.
The connection abort happens at different times during rsync but
occurs every time I try to sync all our files (1.1TB) to the empty
volume.
Client and server side I don't find errors in the gluster log files.
rsync logs the obvious transfer problem. The only log that shows
anything related is the server brick log which states that the
[2018-08-18 22:40:35.502510] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-home-server: disconnecting
connection from brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
[2018-08-18 22:40:35.502620] W
[inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
[2018-08-18 22:40:35.502692] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.502719] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-home-server: Shutting down
connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Since I'm running another replica 3 setup for oVirt for a long time
Is this setup running with the same gluster version and on the same nodes
or is it a different cluster?
Post by Richard Neuboeck
now which is completely stable I thought I made a mistake setting
different options at first. However even when I reset those options
I'm able to reproduce the connection problem.
Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%
performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on
In this case the gluster servers and also the client is using a
bonded network device running in adaptive load balancing mode.
I've tried using the debug option for the client mount. But except
for a ~0.5TB log file I didn't get information that seems helpful to me.
Transferring just a couple of GB works without problems.
It may very well be that I'm already blind to the obvious but after
many long running tests I can't find the crux in the setup.
Does anyone have an idea as how to approach this problem in a way
that sheds some useful information?
Any help is highly appreciated!
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Richard Neuboeck
2018-08-30 11:48:39 UTC
Permalink
Hi Nithya,
Post by Nithya Balachandran
Hi Richard,
Hi Gluster Community,
I have problems with a glusterfs 'Transport endpoint not connected'
connection abort during file transfers that I can replicate (all the
time now) but not pinpoint as to why this is happening.
The volume is set up in replica 3 mode and accessed with the fuse
gluster client. Both client and server are running CentOS and the
supplied 3.12.11 version of gluster.
The connection abort happens at different times during rsync but
occurs every time I try to sync all our files (1.1TB) to the empty
volume.
Client and server side I don't find errors in the gluster log files.
rsync logs the obvious transfer problem. The only log that shows
anything related is the server brick log which states that the
[2018-08-18 22:40:35.502510] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-home-server: disconnecting
connection from
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
[2018-08-18 22:40:35.502620] W
[inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
[2018-08-18 22:40:35.502692] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.502719] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-home-server: Shutting down
connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 
Since I'm running another replica 3 setup for oVirt for a long time
Is this setup running with the same gluster version and on the same
nodes or is it a different cluster?
It's a different cluster (sphere-one, sphere-two and sphere-three)
but the same gluster version and basically the same hardware.

Cheers
Richard
Post by Nithya Balachandran
 
now which is completely stable I thought I made a mistake setting
different options at first. However even when I reset those options
I'm able to reproduce the connection problem.
The unoptimized volume setup looks like this: 
Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%
performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on
In this case the gluster servers and also the client is using a
bonded network device running in adaptive load balancing mode.
I've tried using the debug option for the client mount. But except
for a ~0.5TB log file I didn't get information that seems
helpful to me.
Transferring just a couple of GB works without problems.
It may very well be that I'm already blind to the obvious but after
many long running tests I can't find the crux in the setup.
Does anyone have an idea as how to approach this problem in a way
that sheds some useful information?
Any help is highly appreciated!
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
--
/dev/null
Raghavendra Gowdappa
2018-08-30 12:40:34 UTC
Permalink
Normally client logs will give a clue on why the disconnections are
happening (ping-timeout, wrong port etc). Can you look into client logs to
figure out what's happening? If you can't find anything, can you send
across client logs?
Post by Richard Neuboeck
Hi Gluster Community,
I have problems with a glusterfs 'Transport endpoint not connected'
connection abort during file transfers that I can replicate (all the
time now) but not pinpoint as to why this is happening.
The volume is set up in replica 3 mode and accessed with the fuse
gluster client. Both client and server are running CentOS and the
supplied 3.12.11 version of gluster.
The connection abort happens at different times during rsync but
occurs every time I try to sync all our files (1.1TB) to the empty
volume.
Client and server side I don't find errors in the gluster log files.
rsync logs the obvious transfer problem. The only log that shows
anything related is the server brick log which states that the
[2018-08-18 22:40:35.502510] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-home-server: disconnecting
connection from brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
[2018-08-18 22:40:35.502620] W
[inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
[2018-08-18 22:40:35.502692] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.502719] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-home-server: Shutting down
connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Since I'm running another replica 3 setup for oVirt for a long time
now which is completely stable I thought I made a mistake setting
different options at first. However even when I reset those options
I'm able to reproduce the connection problem.
Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%
performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on
In this case the gluster servers and also the client is using a
bonded network device running in adaptive load balancing mode.
I've tried using the debug option for the client mount. But except
for a ~0.5TB log file I didn't get information that seems helpful to me.
Transferring just a couple of GB works without problems.
It may very well be that I'm already blind to the obvious but after
many long running tests I can't find the crux in the setup.
Does anyone have an idea as how to approach this problem in a way
that sheds some useful information?
Any help is highly appreciated!
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Richard Neuboeck
2018-08-30 13:34:31 UTC
Permalink
Hi,

I'm attaching a shortened version since the whole is about 5.8GB of
the client mount log. It includes the initial mount messages and the
last two minutes of log entries.

It ends very anticlimactic without an obvious error. Is there
anything specific I should be looking for?

Cheers
Richard
Post by Raghavendra Gowdappa
Normally client logs will give a clue on why the disconnections are
happening (ping-timeout, wrong port etc). Can you look into client
logs to figure out what's happening? If you can't find anything, can
you send across client logs?
On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
Hi Gluster Community,
I have problems with a glusterfs 'Transport endpoint not connected'
connection abort during file transfers that I can replicate (all the
time now) but not pinpoint as to why this is happening.
The volume is set up in replica 3 mode and accessed with the fuse
gluster client. Both client and server are running CentOS and the
supplied 3.12.11 version of gluster.
The connection abort happens at different times during rsync but
occurs every time I try to sync all our files (1.1TB) to the empty
volume.
Client and server side I don't find errors in the gluster log files.
rsync logs the obvious transfer problem. The only log that shows
anything related is the server brick log which states that the
[2018-08-18 22:40:35.502510] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-home-server: disconnecting
connection from
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
[2018-08-18 22:40:35.502620] W
[inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
[2018-08-18 22:40:35.502692] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.502719] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-home-server: Shutting down
connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Since I'm running another replica 3 setup for oVirt for a long time
now which is completely stable I thought I made a mistake setting
different options at first. However even when I reset those options
I'm able to reproduce the connection problem.
Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%
performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on
In this case the gluster servers and also the client is using a
bonded network device running in adaptive load balancing mode.
I've tried using the debug option for the client mount. But except
for a ~0.5TB log file I didn't get information that seems
helpful to me.
Transferring just a couple of GB works without problems.
It may very well be that I'm already blind to the obvious but after
many long running tests I can't find the crux in the setup.
Does anyone have an idea as how to approach this problem in a way
that sheds some useful information?
Any help is highly appreciated!
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
--
/dev/null
Raghavendra Gowdappa
2018-08-31 01:50:25 UTC
Permalink
+Mohit. +Milind

@Mohit/Milind,

Can you check logs and see whether you can find anything relevant?
Post by Richard Neuboeck
Hi,
I'm attaching a shortened version since the whole is about 5.8GB of
the client mount log. It includes the initial mount messages and the
last two minutes of log entries.
It ends very anticlimactic without an obvious error. Is there
anything specific I should be looking for?
Normally I look logs around disconnect msgs to find out the reason. But as
you said, sometimes one can see just disconnect msgs without any reason.
That normally points to reason for disconnect in the network rather than a
Glusterfs initiated disconnect.
Post by Richard Neuboeck
Cheers
Richard
Post by Raghavendra Gowdappa
Normally client logs will give a clue on why the disconnections are
happening (ping-timeout, wrong port etc). Can you look into client
logs to figure out what's happening? If you can't find anything, can
you send across client logs?
On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
Hi Gluster Community,
I have problems with a glusterfs 'Transport endpoint not connected'
connection abort during file transfers that I can replicate (all the
time now) but not pinpoint as to why this is happening.
The volume is set up in replica 3 mode and accessed with the fuse
gluster client. Both client and server are running CentOS and the
supplied 3.12.11 version of gluster.
The connection abort happens at different times during rsync but
occurs every time I try to sync all our files (1.1TB) to the empty
volume.
Client and server side I don't find errors in the gluster log files.
rsync logs the obvious transfer problem. The only log that shows
anything related is the server brick log which states that the
[2018-08-18 22:40:35.502510] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-home-server: disconnecting
connection from
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
[2018-08-18 22:40:35.502620] W
[inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
[2018-08-18 22:40:35.502692] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.502719] W
[entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-home-server: Shutting down
connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Since I'm running another replica 3 setup for oVirt for a long time
now which is completely stable I thought I made a mistake setting
different options at first. However even when I reset those options
I'm able to reproduce the connection problem.
Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%
performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on
In this case the gluster servers and also the client is using a
bonded network device running in adaptive load balancing mode.
I've tried using the debug option for the client mount. But except
for a ~0.5TB log file I didn't get information that seems helpful to me.
Transferring just a couple of GB works without problems.
It may very well be that I'm already blind to the obvious but after
many long running tests I can't find the crux in the setup.
Does anyone have an idea as how to approach this problem in a way
that sheds some useful information?
Any help is highly appreciated!
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
--
/dev/null
Richard Neuboeck
2018-08-31 05:41:37 UTC
Permalink
Post by Raghavendra Gowdappa
+Mohit. +Milind
@Mohit/Milind,
Can you check logs and see whether you can find anything relevant?
From glances at the system logs nothing out of the ordinary
occurred. However I'll start another rsync and take a closer look.
It will take a few days.
Post by Raghavendra Gowdappa
On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
Hi,
I'm attaching a shortened version since the whole is about 5.8GB of
the client mount log. It includes the initial mount messages and the
last two minutes of log entries.
It ends very anticlimactic without an obvious error. Is there
anything specific I should be looking for?
Normally I look logs around disconnect msgs to find out the reason.
But as you said, sometimes one can see just disconnect msgs without
any reason. That normally points to reason for disconnect in the
network rather than a Glusterfs initiated disconnect.
The rsync source is serving our homes currently so there are NFS
connections 24/7. There don't seem to be any network related
interruptions - a co-worker would be here faster than I could check
the logs if the connection to home would be broken ;-)
The three gluster machines are due to this problem reduced to only
testing so there is nothing else running.
Post by Raghavendra Gowdappa
Cheers
Richard
Post by Raghavendra Gowdappa
Normally client logs will give a clue on why the disconnections are
happening (ping-timeout, wrong port etc). Can you look into client
logs to figure out what's happening? If you can't find anything, can
you send across client logs?
On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
     Hi Gluster Community,
     I have problems with a glusterfs 'Transport endpoint not
connected'
Post by Raghavendra Gowdappa
     connection abort during file transfers that I can
replicate (all the
Post by Raghavendra Gowdappa
     time now) but not pinpoint as to why this is happening.
     The volume is set up in replica 3 mode and accessed with
the fuse
Post by Raghavendra Gowdappa
     gluster client. Both client and server are running CentOS
and the
Post by Raghavendra Gowdappa
     supplied 3.12.11 version of gluster.
     The connection abort happens at different times during
rsync but
Post by Raghavendra Gowdappa
     occurs every time I try to sync all our files (1.1TB) to
the empty
Post by Raghavendra Gowdappa
     volume.
     Client and server side I don't find errors in the gluster
log files.
Post by Raghavendra Gowdappa
     rsync logs the obvious transfer problem. The only log that
shows
Post by Raghavendra Gowdappa
     anything related is the server brick log which states that the
     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
     [server.c:527:server_rpc_notify] 0-home-server: disconnecting
     connection from
     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     [2018-08-18 22:40:35.502620] W
releasing lock
Post by Raghavendra Gowdappa
     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
     {client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
     [2018-08-18 22:40:35.502692] W
releasing lock
Post by Raghavendra Gowdappa
     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
     [2018-08-18 22:40:35.502719] W
releasing lock
Post by Raghavendra Gowdappa
     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
     [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
     connection
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Post by Raghavendra Gowdappa
     Since I'm running another replica 3 setup for oVirt for a
long time
Post by Raghavendra Gowdappa
     now which is completely stable I thought I made a mistake
setting
Post by Raghavendra Gowdappa
     different options at first. However even when I reset
those options
Post by Raghavendra Gowdappa
     I'm able to reproduce the connection problem.
     Volume Name: home
     Type: Replicate
     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
     Status: Started
     Snapshot Count: 0
     Number of Bricks: 1 x 3 = 3
     Transport-type: tcp
     Brick1: sphere-four:/srv/gluster_home/brick
     Brick2: sphere-five:/srv/gluster_home/brick
     Brick3: sphere-six:/srv/gluster_home/brick
     nfs.disable: on
     transport.address-family: inet
     cluster.quorum-type: auto
     cluster.server-quorum-type: server
     cluster.server-quorum-ratio: 50%
     performance.cache-size: 5GB
     client.event-threads: 4
     server.event-threads: 4
     cluster.lookup-optimize: on
     features.cache-invalidation: on
     performance.stat-prefetch: on
     performance.cache-invalidation: on
     network.inode-lru-limit: 50000
     features.cache-invalidation-timeout: 600
     performance.md-cache-timeout: 600
     performance.parallel-readdir: on
     In this case the gluster servers and also the client is
using a
Post by Raghavendra Gowdappa
     bonded network device running in adaptive load balancing mode.
     I've tried using the debug option for the client mount.
But except
Post by Raghavendra Gowdappa
     for a ~0.5TB log file I didn't get information that seems
     helpful to me.
     Transferring just a couple of GB works without problems.
     It may very well be that I'm already blind to the obvious
but after
Post by Raghavendra Gowdappa
     many long running tests I can't find the crux in the setup.
     Does anyone have an idea as how to approach this problem
in a way
Post by Raghavendra Gowdappa
     that sheds some useful information?
     Any help is highly appreciated!
     Cheers
     Richard
     --
     /dev/null
     _______________________________________________
     Gluster-users mailing list
     https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
     <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>
--
/dev/null
--
/dev/null
Raghavendra Gowdappa
2018-08-31 06:13:59 UTC
Permalink
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
+Mohit. +Milind
@Mohit/Milind,
Can you check logs and see whether you can find anything relevant?
From glances at the system logs nothing out of the ordinary
occurred. However I'll start another rsync and take a closer look.
It will take a few days.
Post by Raghavendra Gowdappa
On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
Hi,
I'm attaching a shortened version since the whole is about 5.8GB of
the client mount log. It includes the initial mount messages and the
last two minutes of log entries.
It ends very anticlimactic without an obvious error. Is there
anything specific I should be looking for?
Normally I look logs around disconnect msgs to find out the reason.
But as you said, sometimes one can see just disconnect msgs without
any reason. That normally points to reason for disconnect in the
network rather than a Glusterfs initiated disconnect.
The rsync source is serving our homes currently so there are NFS
connections 24/7. There don't seem to be any network related
interruptions
Can you set diagnostics.client-log-level and diagnostics.brick-log-level to
TRACE and check logs of both ends of connections - client and brick? To
reduce the logsize, I would suggest to logrotate existing logs and start
with fresh logs when you are about to start so that only relevant logs are
captured. Also, can you take strace of client and brick process using:

strace -o <outputfile> -ff -v -p <pid>

attach both logs and strace. Let's trace through what syscalls on socket
return and then decide whether to inspect tcpdump or not. If you don't want
to repeat tests again, please capture tcpdump too (on both ends of
connection) and send them to us.


- a co-worker would be here faster than I could check
Post by Richard Neuboeck
the logs if the connection to home would be broken ;-)
The three gluster machines are due to this problem reduced to only
testing so there is nothing else running.
Post by Raghavendra Gowdappa
Cheers
Richard
Post by Raghavendra Gowdappa
Normally client logs will give a clue on why the disconnections are
happening (ping-timeout, wrong port etc). Can you look into client
logs to figure out what's happening? If you can't find anything,
can
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
you send across client logs?
On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
Hi Gluster Community,
I have problems with a glusterfs 'Transport endpoint not
connected'
Post by Raghavendra Gowdappa
connection abort during file transfers that I can
replicate (all the
Post by Raghavendra Gowdappa
time now) but not pinpoint as to why this is happening.
The volume is set up in replica 3 mode and accessed with
the fuse
Post by Raghavendra Gowdappa
gluster client. Both client and server are running CentOS
and the
Post by Raghavendra Gowdappa
supplied 3.12.11 version of gluster.
The connection abort happens at different times during
rsync but
Post by Raghavendra Gowdappa
occurs every time I try to sync all our files (1.1TB) to
the empty
Post by Raghavendra Gowdappa
volume.
Client and server side I don't find errors in the gluster
log files.
Post by Raghavendra Gowdappa
rsync logs the obvious transfer problem. The only log that
shows
Post by Raghavendra Gowdappa
anything related is the server brick log which states that the
[2018-08-18 22:40:35.502510] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-home-server: disconnecting
connection from
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
[2018-08-18 22:40:35.502620] W
releasing lock
Post by Raghavendra Gowdappa
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f0000}
[2018-08-18 22:40:35.502692] W
releasing lock
Post by Raghavendra Gowdappa
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.502719] W
releasing lock
Post by Raghavendra Gowdappa
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f0000}
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-home-server: Shutting down
connection
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Post by Raghavendra Gowdappa
Since I'm running another replica 3 setup for oVirt for a
long time
Post by Raghavendra Gowdappa
now which is completely stable I thought I made a mistake
setting
Post by Raghavendra Gowdappa
different options at first. However even when I reset
those options
Post by Raghavendra Gowdappa
I'm able to reproduce the connection problem.
Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%
performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on
In this case the gluster servers and also the client is
using a
Post by Raghavendra Gowdappa
bonded network device running in adaptive load balancing mode.
I've tried using the debug option for the client mount.
But except
Post by Raghavendra Gowdappa
for a ~0.5TB log file I didn't get information that seems
helpful to me.
Transferring just a couple of GB works without problems.
It may very well be that I'm already blind to the obvious
but after
Post by Raghavendra Gowdappa
many long running tests I can't find the crux in the setup.
Does anyone have an idea as how to approach this problem
in a way
Post by Raghavendra Gowdappa
that sheds some useful information?
Any help is highly appreciated!
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
<https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>
--
/dev/null
--
/dev/null
Richard Neuboeck
2018-09-11 08:10:08 UTC
Permalink
Hi,

since I feared that the logs would fill up the partition (again) I
checked the systems daily and finally found the reason. The glusterfs
process on the client runs out of memory and get's killed by OOM after
about four days. Since rsync runs for a couple of days longer till it
ends I never checked the whole time frame in the system logs and never
stumbled upon the OOM message.

Running out of memory on a 128GB RAM system even with a DB occupying
~40% of that is kind of strange though. Might there be a leak?

But this would explain the erratic behavior I've experienced over the
last 1.5 years while trying to work with our homes on glusterfs.

Here is the kernel log message for the killed glusterfs process.
https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a

I'm checking the brick and client trace logs. But those are respectively
1TB and 2TB in size so searching in them takes a while. I'll be creating
gists for both logs about the time when the process died.

As soon as I have more details I'll post them.

Here you can see a graphical representation of the memory usage of this
system: https://imgur.com/a/4BINtfr

Cheers
Richard
On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
Post by Raghavendra Gowdappa
+Mohit. +Milind
@Mohit/Milind,
Can you check logs and see whether you can find anything relevant?
From glances at the system logs nothing out of the ordinary
occurred. However I'll start another rsync and take a closer look.
It will take a few days.
Post by Raghavendra Gowdappa
On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
     Hi,
     I'm attaching a shortened version since the whole is about 5.8GB of
     the client mount log. It includes the initial mount messages and the
     last two minutes of log entries.
     It ends very anticlimactic without an obvious error. Is there
     anything specific I should be looking for?
Normally I look logs around disconnect msgs to find out the reason.
But as you said, sometimes one can see just disconnect msgs without
any reason. That normally points to reason for disconnect in the
network rather than a Glusterfs initiated disconnect.
The rsync source is serving our homes currently so there are NFS
connections 24/7. There don't seem to be any network related
interruptions
Can you set diagnostics.client-log-level and diagnostics.brick-log-level
to TRACE and check logs of both ends of connections - client and brick?
To reduce the logsize, I would suggest to logrotate existing logs and
start with fresh logs when you are about to start so that only relevant
logs are captured. Also, can you take strace of client and brick process
strace -o <outputfile> -ff -v -p <pid>
attach both logs and strace. Let's trace through what syscalls on socket
return and then decide whether to inspect tcpdump or not. If you don't
want to repeat tests again, please capture tcpdump too (on both ends of
connection) and send them to us.
- a co-worker would be here faster than I could check
the logs if the connection to home would be broken ;-)
The three gluster machines are due to this problem reduced to only
testing so there is nothing else running.
Post by Raghavendra Gowdappa
     Cheers
     Richard
     > Normally client logs will give a clue on why the disconnections are
     > happening (ping-timeout, wrong port etc). Can you look into client
     > logs to figure out what's happening? If you can't find anything, can
     > you send across client logs?
     >
     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
     >
     >     Hi Gluster Community,
     >
     >     I have problems with a glusterfs 'Transport endpoint not
     connected'
     >     connection abort during file transfers that I can
     replicate (all the
     >     time now) but not pinpoint as to why this is happening.
     >
     >     The volume is set up in replica 3 mode and accessed with
     the fuse
     >     gluster client. Both client and server are running CentOS
     and the
     >     supplied 3.12.11 version of gluster.
     >
     >     The connection abort happens at different times during
     rsync but
     >     occurs every time I try to sync all our files (1.1TB) to
     the empty
     >     volume.
     >
     >     Client and server side I don't find errors in the gluster
     log files.
     >     rsync logs the obvious transfer problem. The only log that
     shows
     >     anything related is the server brick log which states
that the
Post by Raghavendra Gowdappa
     >
     >     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
disconnecting
Post by Raghavendra Gowdappa
     >     connection from
     >     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >     [2018-08-18 22:40:35.502620] W
     releasing lock
     >     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=d0fd5ffb427f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.502692] W
     releasing lock
     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.502719] W
     releasing lock
     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
     >     [client_t.c:443:gf_client_unref] 0-home-server: Shutting
down
Post by Raghavendra Gowdappa
     >     connection
     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >
     >     Since I'm running another replica 3 setup for oVirt for a
     long time
     >     now which is completely stable I thought I made a mistake
     setting
     >     different options at first. However even when I reset
     those options
     >     I'm able to reproduce the connection problem.
     >
     >
     >     Volume Name: home
     >     Type: Replicate
     >     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
     >     Status: Started
     >     Snapshot Count: 0
     >     Number of Bricks: 1 x 3 = 3
     >     Transport-type: tcp
     >     Brick1: sphere-four:/srv/gluster_home/brick
     >     Brick2: sphere-five:/srv/gluster_home/brick
     >     Brick3: sphere-six:/srv/gluster_home/brick
     >     nfs.disable: on
     >     transport.address-family: inet
     >     cluster.quorum-type: auto
     >     cluster.server-quorum-type: server
     >     cluster.server-quorum-ratio: 50%
     >
     >
     >
     >     performance.cache-size: 5GB
     >     client.event-threads: 4
     >     server.event-threads: 4
     >     cluster.lookup-optimize: on
     >     features.cache-invalidation: on
     >     performance.stat-prefetch: on
     >     performance.cache-invalidation: on
     >     network.inode-lru-limit: 50000
     >     features.cache-invalidation-timeout: 600
     >     performance.md-cache-timeout: 600
     >     performance.parallel-readdir: on
     >
     >
     >     In this case the gluster servers and also the client is
     using a
     >     bonded network device running in adaptive load balancing
mode.
Post by Raghavendra Gowdappa
     >
     >     I've tried using the debug option for the client mount.
     But except
     >     for a ~0.5TB log file I didn't get information that seems
     >     helpful to me.
     >
     >     Transferring just a couple of GB works without problems.
     >
     >     It may very well be that I'm already blind to the obvious
     but after
     >     many long running tests I can't find the crux in the setup.
     >
     >     Does anyone have an idea as how to approach this problem
     in a way
     >     that sheds some useful information?
     >
     >     Any help is highly appreciated!
     >     Cheers
     >     Richard
     >
     >     --
     >     /dev/null
     >
     >
     >
     >
     >     _______________________________________________
     >     Gluster-users mailing list
     >     https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
     <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>
Post by Raghavendra Gowdappa
     >   
 <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
     <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>>
Post by Raghavendra Gowdappa
     >
     >
     --
     /dev/null
--
/dev/null
Richard Neuboeck
2018-09-13 08:07:22 UTC
Permalink
Hi,

I've created excerpts from the brick and client logs +/- 1 minute to
the kill event. Still the logs are ~400-500MB so will put them
somewhere to download since I have no idea what I should be looking
for and skimming them didn't reveal obvious problems to me.

http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log

I was pointed in the direction of the following Bugreport
https://bugzilla.redhat.com/show_bug.cgi?id=1613512
It sounds right but seems to have been addressed already.

If there is anything I can do to help solve this problem please let
me know. Thanks for your help!

Cheers
Richard
Post by Richard Neuboeck
Hi,
since I feared that the logs would fill up the partition (again) I
checked the systems daily and finally found the reason. The glusterfs
process on the client runs out of memory and get's killed by OOM after
about four days. Since rsync runs for a couple of days longer till it
ends I never checked the whole time frame in the system logs and never
stumbled upon the OOM message.
Running out of memory on a 128GB RAM system even with a DB occupying
~40% of that is kind of strange though. Might there be a leak?
But this would explain the erratic behavior I've experienced over the
last 1.5 years while trying to work with our homes on glusterfs.
Here is the kernel log message for the killed glusterfs process.
https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
I'm checking the brick and client trace logs. But those are respectively
1TB and 2TB in size so searching in them takes a while. I'll be creating
gists for both logs about the time when the process died.
As soon as I have more details I'll post them.
Here you can see a graphical representation of the memory usage of this
system: https://imgur.com/a/4BINtfr
Cheers
Richard
On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
Post by Raghavendra Gowdappa
+Mohit. +Milind
@Mohit/Milind,
Can you check logs and see whether you can find anything relevant?
From glances at the system logs nothing out of the ordinary
occurred. However I'll start another rsync and take a closer look.
It will take a few days.
Post by Raghavendra Gowdappa
On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
     Hi,
     I'm attaching a shortened version since the whole is about 5.8GB of
     the client mount log. It includes the initial mount messages and the
     last two minutes of log entries.
     It ends very anticlimactic without an obvious error. Is there
     anything specific I should be looking for?
Normally I look logs around disconnect msgs to find out the reason.
But as you said, sometimes one can see just disconnect msgs without
any reason. That normally points to reason for disconnect in the
network rather than a Glusterfs initiated disconnect.
The rsync source is serving our homes currently so there are NFS
connections 24/7. There don't seem to be any network related
interruptions
Can you set diagnostics.client-log-level and diagnostics.brick-log-level
to TRACE and check logs of both ends of connections - client and brick?
To reduce the logsize, I would suggest to logrotate existing logs and
start with fresh logs when you are about to start so that only relevant
logs are captured. Also, can you take strace of client and brick process
strace -o <outputfile> -ff -v -p <pid>
attach both logs and strace. Let's trace through what syscalls on socket
return and then decide whether to inspect tcpdump or not. If you don't
want to repeat tests again, please capture tcpdump too (on both ends of
connection) and send them to us.
- a co-worker would be here faster than I could check
the logs if the connection to home would be broken ;-)
The three gluster machines are due to this problem reduced to only
testing so there is nothing else running.
Post by Raghavendra Gowdappa
     Cheers
     Richard
     > Normally client logs will give a clue on why the disconnections are
     > happening (ping-timeout, wrong port etc). Can you look into client
     > logs to figure out what's happening? If you can't find anything, can
     > you send across client logs?
     >
     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
     >
     >     Hi Gluster Community,
     >
     >     I have problems with a glusterfs 'Transport endpoint not
     connected'
     >     connection abort during file transfers that I can
     replicate (all the
     >     time now) but not pinpoint as to why this is happening.
     >
     >     The volume is set up in replica 3 mode and accessed with
     the fuse
     >     gluster client. Both client and server are running CentOS
     and the
     >     supplied 3.12.11 version of gluster.
     >
     >     The connection abort happens at different times during
     rsync but
     >     occurs every time I try to sync all our files (1.1TB) to
     the empty
     >     volume.
     >
     >     Client and server side I don't find errors in the gluster
     log files.
     >     rsync logs the obvious transfer problem. The only log that
     shows
     >     anything related is the server brick log which states
that the
Post by Raghavendra Gowdappa
     >
     >     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
disconnecting
Post by Raghavendra Gowdappa
     >     connection from
     >     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >     [2018-08-18 22:40:35.502620] W
     releasing lock
     >     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=d0fd5ffb427f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.502692] W
     releasing lock
     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.502719] W
     releasing lock
     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
     >     [client_t.c:443:gf_client_unref] 0-home-server: Shutting
down
Post by Raghavendra Gowdappa
     >     connection
     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >
     >     Since I'm running another replica 3 setup for oVirt for a
     long time
     >     now which is completely stable I thought I made a mistake
     setting
     >     different options at first. However even when I reset
     those options
     >     I'm able to reproduce the connection problem.
     >
     >
     >     Volume Name: home
     >     Type: Replicate
     >     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
     >     Status: Started
     >     Snapshot Count: 0
     >     Number of Bricks: 1 x 3 = 3
     >     Transport-type: tcp
     >     Brick1: sphere-four:/srv/gluster_home/brick
     >     Brick2: sphere-five:/srv/gluster_home/brick
     >     Brick3: sphere-six:/srv/gluster_home/brick
     >     nfs.disable: on
     >     transport.address-family: inet
     >     cluster.quorum-type: auto
     >     cluster.server-quorum-type: server
     >     cluster.server-quorum-ratio: 50%
     >
     >
     >
     >     performance.cache-size: 5GB
     >     client.event-threads: 4
     >     server.event-threads: 4
     >     cluster.lookup-optimize: on
     >     features.cache-invalidation: on
     >     performance.stat-prefetch: on
     >     performance.cache-invalidation: on
     >     network.inode-lru-limit: 50000
     >     features.cache-invalidation-timeout: 600
     >     performance.md-cache-timeout: 600
     >     performance.parallel-readdir: on
     >
     >
     >     In this case the gluster servers and also the client is
     using a
     >     bonded network device running in adaptive load balancing
mode.
Post by Raghavendra Gowdappa
     >
     >     I've tried using the debug option for the client mount.
     But except
     >     for a ~0.5TB log file I didn't get information that seems
     >     helpful to me.
     >
     >     Transferring just a couple of GB works without problems.
     >
     >     It may very well be that I'm already blind to the obvious
     but after
     >     many long running tests I can't find the crux in the setup.
     >
     >     Does anyone have an idea as how to approach this problem
     in a way
     >     that sheds some useful information?
     >
     >     Any help is highly appreciated!
     >     Cheers
     >     Richard
     >
     >     --
     >     /dev/null
     >
     >
     >
     >
     >     _______________________________________________
     >     Gluster-users mailing list
     >     https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
     <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>
Post by Raghavendra Gowdappa
     >   
 <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
     <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>>
Post by Raghavendra Gowdappa
     >
     >
     --
     /dev/null
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
--
/dev/null
Richard Neuboeck
2018-09-21 07:14:11 UTC
Permalink
Hi again,

in my limited - non full time programmer - understanding it's a memory
leak in the gluster fuse client.

Should I reopen the mentioned bugreport or open a new one? Or would the
community prefer an entirely different approach?

Thanks
Richard
Post by Richard Neuboeck
Hi,
I've created excerpts from the brick and client logs +/- 1 minute to
the kill event. Still the logs are ~400-500MB so will put them
somewhere to download since I have no idea what I should be looking
for and skimming them didn't reveal obvious problems to me.
http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
I was pointed in the direction of the following Bugreport
https://bugzilla.redhat.com/show_bug.cgi?id=1613512
It sounds right but seems to have been addressed already.
If there is anything I can do to help solve this problem please let
me know. Thanks for your help!
Cheers
Richard
Post by Richard Neuboeck
Hi,
since I feared that the logs would fill up the partition (again) I
checked the systems daily and finally found the reason. The glusterfs
process on the client runs out of memory and get's killed by OOM after
about four days. Since rsync runs for a couple of days longer till it
ends I never checked the whole time frame in the system logs and never
stumbled upon the OOM message.
Running out of memory on a 128GB RAM system even with a DB occupying
~40% of that is kind of strange though. Might there be a leak?
But this would explain the erratic behavior I've experienced over the
last 1.5 years while trying to work with our homes on glusterfs.
Here is the kernel log message for the killed glusterfs process.
https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
I'm checking the brick and client trace logs. But those are respectively
1TB and 2TB in size so searching in them takes a while. I'll be creating
gists for both logs about the time when the process died.
As soon as I have more details I'll post them.
Here you can see a graphical representation of the memory usage of this
system: https://imgur.com/a/4BINtfr
Cheers
Richard
On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
Post by Raghavendra Gowdappa
+Mohit. +Milind
@Mohit/Milind,
Can you check logs and see whether you can find anything relevant?
From glances at the system logs nothing out of the ordinary
occurred. However I'll start another rsync and take a closer look.
It will take a few days.
Post by Raghavendra Gowdappa
On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
     Hi,
     I'm attaching a shortened version since the whole is about 5.8GB of
     the client mount log. It includes the initial mount messages and the
     last two minutes of log entries.
     It ends very anticlimactic without an obvious error. Is there
     anything specific I should be looking for?
Normally I look logs around disconnect msgs to find out the reason.
But as you said, sometimes one can see just disconnect msgs without
any reason. That normally points to reason for disconnect in the
network rather than a Glusterfs initiated disconnect.
The rsync source is serving our homes currently so there are NFS
connections 24/7. There don't seem to be any network related
interruptions
Can you set diagnostics.client-log-level and diagnostics.brick-log-level
to TRACE and check logs of both ends of connections - client and brick?
To reduce the logsize, I would suggest to logrotate existing logs and
start with fresh logs when you are about to start so that only relevant
logs are captured. Also, can you take strace of client and brick process
strace -o <outputfile> -ff -v -p <pid>
attach both logs and strace. Let's trace through what syscalls on socket
return and then decide whether to inspect tcpdump or not. If you don't
want to repeat tests again, please capture tcpdump too (on both ends of
connection) and send them to us.
- a co-worker would be here faster than I could check
the logs if the connection to home would be broken ;-)
The three gluster machines are due to this problem reduced to only
testing so there is nothing else running.
Post by Raghavendra Gowdappa
     Cheers
     Richard
     > Normally client logs will give a clue on why the disconnections are
     > happening (ping-timeout, wrong port etc). Can you look into client
     > logs to figure out what's happening? If you can't find anything, can
     > you send across client logs?
     >
     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
     >
     >     Hi Gluster Community,
     >
     >     I have problems with a glusterfs 'Transport endpoint not
     connected'
     >     connection abort during file transfers that I can
     replicate (all the
     >     time now) but not pinpoint as to why this is happening.
     >
     >     The volume is set up in replica 3 mode and accessed with
     the fuse
     >     gluster client. Both client and server are running CentOS
     and the
     >     supplied 3.12.11 version of gluster.
     >
     >     The connection abort happens at different times during
     rsync but
     >     occurs every time I try to sync all our files (1.1TB) to
     the empty
     >     volume.
     >
     >     Client and server side I don't find errors in the gluster
     log files.
     >     rsync logs the obvious transfer problem. The only log that
     shows
     >     anything related is the server brick log which states
that the
Post by Raghavendra Gowdappa
     >
     >     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
disconnecting
Post by Raghavendra Gowdappa
     >     connection from
     >     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >     [2018-08-18 22:40:35.502620] W
     releasing lock
     >     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=d0fd5ffb427f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.502692] W
     releasing lock
     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.502719] W
     releasing lock
     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     {client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
     >     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
     >     [client_t.c:443:gf_client_unref] 0-home-server: Shutting
down
Post by Raghavendra Gowdappa
     >     connection
     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >
     >     Since I'm running another replica 3 setup for oVirt for a
     long time
     >     now which is completely stable I thought I made a mistake
     setting
     >     different options at first. However even when I reset
     those options
     >     I'm able to reproduce the connection problem.
     >
     >
     >     Volume Name: home
     >     Type: Replicate
     >     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
     >     Status: Started
     >     Snapshot Count: 0
     >     Number of Bricks: 1 x 3 = 3
     >     Transport-type: tcp
     >     Brick1: sphere-four:/srv/gluster_home/brick
     >     Brick2: sphere-five:/srv/gluster_home/brick
     >     Brick3: sphere-six:/srv/gluster_home/brick
     >     nfs.disable: on
     >     transport.address-family: inet
     >     cluster.quorum-type: auto
     >     cluster.server-quorum-type: server
     >     cluster.server-quorum-ratio: 50%
     >
     >
     >
     >     performance.cache-size: 5GB
     >     client.event-threads: 4
     >     server.event-threads: 4
     >     cluster.lookup-optimize: on
     >     features.cache-invalidation: on
     >     performance.stat-prefetch: on
     >     performance.cache-invalidation: on
     >     network.inode-lru-limit: 50000
     >     features.cache-invalidation-timeout: 600
     >     performance.md-cache-timeout: 600
     >     performance.parallel-readdir: on
     >
     >
     >     In this case the gluster servers and also the client is
     using a
     >     bonded network device running in adaptive load balancing
mode.
Post by Raghavendra Gowdappa
     >
     >     I've tried using the debug option for the client mount.
     But except
     >     for a ~0.5TB log file I didn't get information that seems
     >     helpful to me.
     >
     >     Transferring just a couple of GB works without problems.
     >
     >     It may very well be that I'm already blind to the obvious
     but after
     >     many long running tests I can't find the crux in the setup.
     >
     >     Does anyone have an idea as how to approach this problem
     in a way
     >     that sheds some useful information?
     >
     >     Any help is highly appreciated!
     >     Cheers
     >     Richard
     >
     >     --
     >     /dev/null
     >
     >
     >
     >
     >     _______________________________________________
     >     Gluster-users mailing list
     >     https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
     <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>
Post by Raghavendra Gowdappa
     >   
 <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
     <https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>>
Post by Raghavendra Gowdappa
     >
     >
     --
     /dev/null
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Vijay Bellur
2018-09-24 22:39:40 UTC
Permalink
Hello Richard,

Thank you for the logs.

I am wondering if this could be a different memory leak than the one
addressed in the bug. Would it be possible for you to obtain a statedump of
the client so that we can understand the memory allocation pattern better?
Details about gathering a statedump can be found at [1]. Please ensure that
/var/run/gluster is present before triggering a statedump.

Regards,
Vijay

[1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
Post by Richard Neuboeck
Hi again,
in my limited - non full time programmer - understanding it's a memory
leak in the gluster fuse client.
Should I reopen the mentioned bugreport or open a new one? Or would the
community prefer an entirely different approach?
Thanks
Richard
Post by Richard Neuboeck
Hi,
I've created excerpts from the brick and client logs +/- 1 minute to
the kill event. Still the logs are ~400-500MB so will put them
somewhere to download since I have no idea what I should be looking
for and skimming them didn't reveal obvious problems to me.
http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
I was pointed in the direction of the following Bugreport
https://bugzilla.redhat.com/show_bug.cgi?id=1613512
It sounds right but seems to have been addressed already.
If there is anything I can do to help solve this problem please let
me know. Thanks for your help!
Cheers
Richard
Post by Richard Neuboeck
Hi,
since I feared that the logs would fill up the partition (again) I
checked the systems daily and finally found the reason. The glusterfs
process on the client runs out of memory and get's killed by OOM after
about four days. Since rsync runs for a couple of days longer till it
ends I never checked the whole time frame in the system logs and never
stumbled upon the OOM message.
Running out of memory on a 128GB RAM system even with a DB occupying
~40% of that is kind of strange though. Might there be a leak?
But this would explain the erratic behavior I've experienced over the
last 1.5 years while trying to work with our homes on glusterfs.
Here is the kernel log message for the killed glusterfs process.
https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
I'm checking the brick and client trace logs. But those are respectively
1TB and 2TB in size so searching in them takes a while. I'll be creating
gists for both logs about the time when the process died.
As soon as I have more details I'll post them.
Here you can see a graphical representation of the memory usage of this
system: https://imgur.com/a/4BINtfr
Cheers
Richard
On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
Post by Raghavendra Gowdappa
+Mohit. +Milind
@Mohit/Milind,
Can you check logs and see whether you can find anything
relevant?
Post by Richard Neuboeck
Post by Richard Neuboeck
From glances at the system logs nothing out of the ordinary
occurred. However I'll start another rsync and take a closer look.
It will take a few days.
Post by Raghavendra Gowdappa
On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
Hi,
I'm attaching a shortened version since the whole is about
5.8GB of
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
the client mount log. It includes the initial mount messages
and the
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
last two minutes of log entries.
It ends very anticlimactic without an obvious error. Is there
anything specific I should be looking for?
Normally I look logs around disconnect msgs to find out the
reason.
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
But as you said, sometimes one can see just disconnect msgs
without
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
any reason. That normally points to reason for disconnect in the
network rather than a Glusterfs initiated disconnect.
The rsync source is serving our homes currently so there are NFS
connections 24/7. There don't seem to be any network related
interruptions
Can you set diagnostics.client-log-level and
diagnostics.brick-log-level
Post by Richard Neuboeck
Post by Richard Neuboeck
to TRACE and check logs of both ends of connections - client and brick?
To reduce the logsize, I would suggest to logrotate existing logs and
start with fresh logs when you are about to start so that only relevant
logs are captured. Also, can you take strace of client and brick
process
Post by Richard Neuboeck
Post by Richard Neuboeck
strace -o <outputfile> -ff -v -p <pid>
attach both logs and strace. Let's trace through what syscalls on
socket
Post by Richard Neuboeck
Post by Richard Neuboeck
return and then decide whether to inspect tcpdump or not. If you don't
want to repeat tests again, please capture tcpdump too (on both ends of
connection) and send them to us.
- a co-worker would be here faster than I could check
the logs if the connection to home would be broken ;-)
The three gluster machines are due to this problem reduced to only
testing so there is nothing else running.
Post by Raghavendra Gowdappa
Cheers
Richard
Post by Raghavendra Gowdappa
Normally client logs will give a clue on why the
disconnections are
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
happening (ping-timeout, wrong port etc). Can you look
into client
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
logs to figure out what's happening? If you can't find
anything, can
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
you send across client logs?
On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
Hi Gluster Community,
I have problems with a glusterfs 'Transport endpoint
not
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
connected'
Post by Raghavendra Gowdappa
connection abort during file transfers that I can
replicate (all the
Post by Raghavendra Gowdappa
time now) but not pinpoint as to why this is happening.
The volume is set up in replica 3 mode and accessed
with
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
the fuse
Post by Raghavendra Gowdappa
gluster client. Both client and server are running
CentOS
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
and the
Post by Raghavendra Gowdappa
supplied 3.12.11 version of gluster.
The connection abort happens at different times during
rsync but
Post by Raghavendra Gowdappa
occurs every time I try to sync all our files (1.1TB)
to
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
the empty
Post by Raghavendra Gowdappa
volume.
Client and server side I don't find errors in the
gluster
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
log files.
Post by Raghavendra Gowdappa
rsync logs the obvious transfer problem. The only log
that
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
shows
Post by Raghavendra Gowdappa
anything related is the server brick log which states
that the
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
[2018-08-18 22:40:35.502510] I [MSGID: 115036]
disconnecting
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
connection from
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
[2018-08-18 22:40:35.502620] W
releasing lock
Post by Raghavendra Gowdappa
on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
{client=0x7f83ec0b3ce0, pid=110423
lk-owner=d0fd5ffb427f0000}
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
[2018-08-18 22:40:35.502692] W
releasing lock
Post by Raghavendra Gowdappa
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
[2018-08-18 22:40:35.502719] W
releasing lock
Post by Raghavendra Gowdappa
on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
{client=0x7f83ec0b3ce0, pid=110423
lk-owner=703dd4cc407f0000}
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
[2018-08-18 22:40:35.505950] I [MSGID: 101055]
Shutting
Post by Richard Neuboeck
Post by Richard Neuboeck
down
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
connection
brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Post by Raghavendra Gowdappa
Since I'm running another replica 3 setup for oVirt
for a
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
long time
Post by Raghavendra Gowdappa
now which is completely stable I thought I made a
mistake
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
setting
Post by Raghavendra Gowdappa
different options at first. However even when I reset
those options
Post by Raghavendra Gowdappa
I'm able to reproduce the connection problem.
Volume Name: home
Type: Replicate
Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-four:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-six:/srv/gluster_home/brick
nfs.disable: on
transport.address-family: inet
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 50%
performance.cache-size: 5GB
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
features.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-invalidation: on
network.inode-lru-limit: 50000
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
performance.parallel-readdir: on
In this case the gluster servers and also the client is
using a
Post by Raghavendra Gowdappa
bonded network device running in adaptive load
balancing
Post by Richard Neuboeck
Post by Richard Neuboeck
mode.
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
I've tried using the debug option for the client mount.
But except
Post by Raghavendra Gowdappa
for a ~0.5TB log file I didn't get information that
seems
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
helpful to me.
Transferring just a couple of GB works without
problems.
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
It may very well be that I'm already blind to the
obvious
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
but after
Post by Raghavendra Gowdappa
many long running tests I can't find the crux in the
setup.
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Does anyone have an idea as how to approach this
problem
Post by Richard Neuboeck
Post by Richard Neuboeck
Post by Raghavendra Gowdappa
in a way
Post by Raghavendra Gowdappa
that sheds some useful information?
Any help is highly appreciated!
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Post by Richard Neuboeck
Post by Richard Neuboeck
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
<https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>
<https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>
Post by Raghavendra Gowdappa
<https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>>>
Post by Raghavendra Gowdappa
--
/dev/null
--
/dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Richard Neuboeck
2018-10-15 08:48:27 UTC
Permalink
Hi Vijay,

sorry it took so long. I've upgraded the gluster server and client to
the latest packages 3.12.14-1.el7.x86_64 available in CentOS.

Incredibly my first test after the update worked perfectly! I'll do
another couple of rsyncs, maybe apply the performance improvements again
and do statedumps all the way.

I'll report back if there are any more problems or if they are resolved.

Thanks for the help so far!
Cheers
Richard
Post by Vijay Bellur
Hello Richard,
Thank you for the logs.
I am wondering if this could be a different memory leak than the one
addressed in the bug. Would it be possible for you to obtain a
statedump of the client so that we can understand the memory allocation
pattern better? Details about gathering a statedump can be found at [1].
Please ensure that /var/run/gluster is present before triggering a
statedump.
Regards,
Vijay
[1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
Hi again,
in my limited - non full time programmer - understanding it's a memory
leak in the gluster fuse client.
Should I reopen the mentioned bugreport or open a new one? Or would the
community prefer an entirely different approach?
Thanks
Richard
Post by Richard Neuboeck
Hi,
I've created excerpts from the brick and client logs +/- 1 minute to
the kill event. Still the logs are ~400-500MB so will put them
somewhere to download since I have no idea what I should be looking
for and skimming them didn't reveal obvious problems to me.
http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
<http://www.tbi.univie.ac.at/%7Ehawk/gluster/brick_3min_excerpt.log>
Post by Richard Neuboeck
http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
<http://www.tbi.univie.ac.at/%7Ehawk/gluster/mnt_3min_excerpt.log>
Post by Richard Neuboeck
I was pointed in the direction of the following Bugreport
https://bugzilla.redhat.com/show_bug.cgi?id=1613512
It sounds right but seems to have been addressed already.
If there is anything I can do to help solve this problem please let
me know. Thanks for your help!
Cheers
Richard
Post by Richard Neuboeck
Hi,
since I feared that the logs would fill up the partition (again) I
checked the systems daily and finally found the reason. The glusterfs
process on the client runs out of memory and get's killed by OOM
after
Post by Richard Neuboeck
Post by Richard Neuboeck
about four days. Since rsync runs for a couple of days longer till it
ends I never checked the whole time frame in the system logs and
never
Post by Richard Neuboeck
Post by Richard Neuboeck
stumbled upon the OOM message.
Running out of memory on a 128GB RAM system even with a DB occupying
~40% of that is kind of strange though. Might there be a leak?
But this would explain the erratic behavior I've experienced over the
last 1.5 years while trying to work with our homes on glusterfs.
Here is the kernel log message for the killed glusterfs process.
https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
I'm checking the brick and client trace logs. But those are
respectively
Post by Richard Neuboeck
Post by Richard Neuboeck
1TB and 2TB in size so searching in them takes a while. I'll be
creating
Post by Richard Neuboeck
Post by Richard Neuboeck
gists for both logs about the time when the process died.
As soon as I have more details I'll post them.
Here you can see a graphical representation of the memory usage
of this
Post by Richard Neuboeck
Post by Richard Neuboeck
system: https://imgur.com/a/4BINtfr
Cheers
Richard
On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
     > +Mohit. +Milind
     >
     >
     > Can you check logs and see whether you can find anything
relevant?
Post by Richard Neuboeck
Post by Richard Neuboeck
     From glances at the system logs nothing out of the ordinary
     occurred. However I'll start another rsync and take a closer
look.
Post by Richard Neuboeck
Post by Richard Neuboeck
     It will take a few days.
     >
     > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
     >
     >     Hi,
     >
     >     I'm attaching a shortened version since the whole is
about 5.8GB of
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     the client mount log. It includes the initial mount
messages and the
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     last two minutes of log entries.
     >
     >     It ends very anticlimactic without an obvious error.
Is there
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     anything specific I should be looking for?
     >
     >
     > Normally I look logs around disconnect msgs to find out
the reason.
Post by Richard Neuboeck
Post by Richard Neuboeck
     > But as you said, sometimes one can see just disconnect
msgs without
Post by Richard Neuboeck
Post by Richard Neuboeck
     > any reason. That normally points to reason for disconnect
in the
Post by Richard Neuboeck
Post by Richard Neuboeck
     > network rather than a Glusterfs initiated disconnect.
     The rsync source is serving our homes currently so there are NFS
     connections 24/7. There don't seem to be any network related
     interruptions
Can you set diagnostics.client-log-level and
diagnostics.brick-log-level
Post by Richard Neuboeck
Post by Richard Neuboeck
to TRACE and check logs of both ends of connections - client and
brick?
Post by Richard Neuboeck
Post by Richard Neuboeck
To reduce the logsize, I would suggest to logrotate existing
logs and
Post by Richard Neuboeck
Post by Richard Neuboeck
start with fresh logs when you are about to start so that only
relevant
Post by Richard Neuboeck
Post by Richard Neuboeck
logs are captured. Also, can you take strace of client and brick
process
Post by Richard Neuboeck
Post by Richard Neuboeck
strace -o <outputfile> -ff -v -p <pid>
attach both logs and strace. Let's trace through what syscalls
on socket
Post by Richard Neuboeck
Post by Richard Neuboeck
return and then decide whether to inspect tcpdump or not. If you
don't
Post by Richard Neuboeck
Post by Richard Neuboeck
want to repeat tests again, please capture tcpdump too (on both
ends of
Post by Richard Neuboeck
Post by Richard Neuboeck
connection) and send them to us.
     - a co-worker would be here faster than I could check
     the logs if the connection to home would be broken ;-)
     The three gluster machines are due to this problem reduced
to only
Post by Richard Neuboeck
Post by Richard Neuboeck
     testing so there is nothing else running.
     >
     >     Cheers
     >     Richard
     >
     >     > Normally client logs will give a clue on why the
disconnections are
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     > happening (ping-timeout, wrong port etc). Can you
look into client
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     > logs to figure out what's happening? If you can't
find anything, can
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     > you send across client logs?
     >     >
     >     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
     >     >
     >     >     Hi Gluster Community,
     >     >
     >     >     I have problems with a glusterfs 'Transport
endpoint not
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     connected'
     >     >     connection abort during file transfers that I can
     >     replicate (all the
     >     >     time now) but not pinpoint as to why this is
happening.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >
     >     >     The volume is set up in replica 3 mode and
accessed with
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     the fuse
     >     >     gluster client. Both client and server are
running CentOS
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     and the
     >     >     supplied 3.12.11 version of gluster.
     >     >
     >     >     The connection abort happens at different times
during
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     rsync but
     >     >     occurs every time I try to sync all our files
(1.1TB) to
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     the empty
     >     >     volume.
     >     >
     >     >     Client and server side I don't find errors in
the gluster
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     log files.
     >     >     rsync logs the obvious transfer problem. The
only log that
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     shows
     >     >     anything related is the server brick log which
states
Post by Richard Neuboeck
Post by Richard Neuboeck
     that the
     >     >
     >     >     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
     disconnecting
     >     >     connection from
     >     >   
 brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >     [2018-08-18 22:40:35.502620] W
     >     >     [inodelk.c:499:pl_inodelk_log_cleanup]
     >     releasing lock
     >     >     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
     >     >     {client=0x7f83ec0b3ce0, pid=110423
     lk-owner=d0fd5ffb427f0000}
     >     >     [2018-08-18 22:40:35.502692] W
     >     >     [entrylk.c:864:pl_entrylk_log_cleanup]
     >     releasing lock
     >     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     >     {client=0x7f83ec0b3ce0, pid=110423
     lk-owner=703dd4cc407f0000}
     >     >     [2018-08-18 22:40:35.502719] W
     >     >     [entrylk.c:864:pl_entrylk_log_cleanup]
     >     releasing lock
     >     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     >     {client=0x7f83ec0b3ce0, pid=110423
     lk-owner=703dd4cc407f0000}
     >     >     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
Shutting
Post by Richard Neuboeck
Post by Richard Neuboeck
     down
     >     >     connection
     >     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >     >
     >     >     Since I'm running another replica 3 setup for
oVirt for a
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     long time
     >     >     now which is completely stable I thought I made
a mistake
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     setting
     >     >     different options at first. However even when I
reset
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     those options
     >     >     I'm able to reproduce the connection problem.
     >     >
     >     >
     >     >     Volume Name: home
     >     >     Type: Replicate
     >     >     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
     >     >     Status: Started
     >     >     Snapshot Count: 0
     >     >     Number of Bricks: 1 x 3 = 3
     >     >     Transport-type: tcp
     >     >     Brick1: sphere-four:/srv/gluster_home/brick
     >     >     Brick2: sphere-five:/srv/gluster_home/brick
     >     >     Brick3: sphere-six:/srv/gluster_home/brick
     >     >     nfs.disable: on
     >     >     transport.address-family: inet
     >     >     cluster.quorum-type: auto
     >     >     cluster.server-quorum-type: server
     >     >     cluster.server-quorum-ratio: 50%
     >     >
     >     >
     >     >
     >     >     performance.cache-size: 5GB
     >     >     client.event-threads: 4
     >     >     server.event-threads: 4
     >     >     cluster.lookup-optimize: on
     >     >     features.cache-invalidation: on
     >     >     performance.stat-prefetch: on
     >     >     performance.cache-invalidation: on
     >     >     network.inode-lru-limit: 50000
     >     >     features.cache-invalidation-timeout: 600
     >     >     performance.md-cache-timeout: 600
     >     >     performance.parallel-readdir: on
     >     >
     >     >
     >     >     In this case the gluster servers and also the
client is
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     using a
     >     >     bonded network device running in adaptive load
balancing
Post by Richard Neuboeck
Post by Richard Neuboeck
     mode.
     >     >
     >     >     I've tried using the debug option for the client
mount.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     But except
     >     >     for a ~0.5TB log file I didn't get information
that seems
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >     helpful to me.
     >     >
     >     >     Transferring just a couple of GB works without
problems.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >
     >     >     It may very well be that I'm already blind to
the obvious
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     but after
     >     >     many long running tests I can't find the crux in
the setup.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >
     >     >     Does anyone have an idea as how to approach this
problem
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     in a way
     >     >     that sheds some useful information?
     >     >
     >     >     Any help is highly appreciated!
     >     >     Cheers
     >     >     Richard
     >     >
     >     >     --
     >     >     /dev/null
     >     >
     >     >
     >     >
     >     >
     >     >     _______________________________________________
     >     >     Gluster-users mailing list
     >     >   
 https://lists.gluster.org/mailman/listinfo/gluster-users
Post by Richard Neuboeck
Post by Richard Neuboeck
     <https://lists.gluster.org/mailman/listinfo/gluster-users>
     >     <https://lists.gluster.org/mailman/listinfo/gluster-users
     <https://lists.gluster.org/mailman/listinfo/gluster-users>>
     >     >   
      <https://lists.gluster.org/mailman/listinfo/gluster-users
     <https://lists.gluster.org/mailman/listinfo/gluster-users>
     >     <https://lists.gluster.org/mailman/listinfo/gluster-users
     <https://lists.gluster.org/mailman/listinfo/gluster-users>>>
     >     >
     >     >
     >
     >
     >     --
     >     /dev/null
     >
     >
     --
     /dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Richard Neuboeck
2018-11-21 09:22:17 UTC
Permalink
Hi Vijay,

this is an update to the 8 tests I've run so far. In short all is well.

I followed your advice and created state dumps every 3 hours. 4 tests
ran with the default volume options. The last 4 tests ran with all
performance optimizations I could find to increase small file performance.

During the run time the dump file size varied from the beginning of the
mount ~100KB to ~1GB reflecting the memory footprint of the gluster process.

Since every test ran without interruption the memory leak seems to be
fixed in 3.12.14-1.el7.x86_64 on CentOS 7.

Thanks again for you help.
Cheers
Richard
Post by Richard Neuboeck
Hi Vijay,
sorry it took so long. I've upgraded the gluster server and client to
the latest packages 3.12.14-1.el7.x86_64 available in CentOS.
Incredibly my first test after the update worked perfectly! I'll do
another couple of rsyncs, maybe apply the performance improvements again
and do statedumps all the way.
I'll report back if there are any more problems or if they are resolved.
Thanks for the help so far!
Cheers
Richard
Post by Vijay Bellur
Hello Richard,
Thank you for the logs.
I am wondering if this could be a different memory leak than the one
addressed in the bug. Would it be possible for you to obtain a
statedump of the client so that we can understand the memory allocation
pattern better? Details about gathering a statedump can be found at [1].
Please ensure that /var/run/gluster is present before triggering a
statedump.
Regards,
Vijay
[1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
Hi again,
in my limited - non full time programmer - understanding it's a memory
leak in the gluster fuse client.
Should I reopen the mentioned bugreport or open a new one? Or would the
community prefer an entirely different approach?
Thanks
Richard
Post by Richard Neuboeck
Hi,
I've created excerpts from the brick and client logs +/- 1 minute to
the kill event. Still the logs are ~400-500MB so will put them
somewhere to download since I have no idea what I should be looking
for and skimming them didn't reveal obvious problems to me.
http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
<http://www.tbi.univie.ac.at/%7Ehawk/gluster/brick_3min_excerpt.log>
Post by Richard Neuboeck
http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
<http://www.tbi.univie.ac.at/%7Ehawk/gluster/mnt_3min_excerpt.log>
Post by Richard Neuboeck
I was pointed in the direction of the following Bugreport
https://bugzilla.redhat.com/show_bug.cgi?id=1613512
It sounds right but seems to have been addressed already.
If there is anything I can do to help solve this problem please let
me know. Thanks for your help!
Cheers
Richard
Post by Richard Neuboeck
Hi,
since I feared that the logs would fill up the partition (again) I
checked the systems daily and finally found the reason. The glusterfs
process on the client runs out of memory and get's killed by OOM
after
Post by Richard Neuboeck
Post by Richard Neuboeck
about four days. Since rsync runs for a couple of days longer till it
ends I never checked the whole time frame in the system logs and
never
Post by Richard Neuboeck
Post by Richard Neuboeck
stumbled upon the OOM message.
Running out of memory on a 128GB RAM system even with a DB occupying
~40% of that is kind of strange though. Might there be a leak?
But this would explain the erratic behavior I've experienced over the
last 1.5 years while trying to work with our homes on glusterfs.
Here is the kernel log message for the killed glusterfs process.
https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
I'm checking the brick and client trace logs. But those are
respectively
Post by Richard Neuboeck
Post by Richard Neuboeck
1TB and 2TB in size so searching in them takes a while. I'll be
creating
Post by Richard Neuboeck
Post by Richard Neuboeck
gists for both logs about the time when the process died.
As soon as I have more details I'll post them.
Here you can see a graphical representation of the memory usage
of this
Post by Richard Neuboeck
Post by Richard Neuboeck
system: https://imgur.com/a/4BINtfr
Cheers
Richard
On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
     > +Mohit. +Milind
     >
     >
     > Can you check logs and see whether you can find anything
relevant?
Post by Richard Neuboeck
Post by Richard Neuboeck
     From glances at the system logs nothing out of the ordinary
     occurred. However I'll start another rsync and take a closer
look.
Post by Richard Neuboeck
Post by Richard Neuboeck
     It will take a few days.
     >
     > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
     >
     >     Hi,
     >
     >     I'm attaching a shortened version since the whole is
about 5.8GB of
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     the client mount log. It includes the initial mount
messages and the
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     last two minutes of log entries.
     >
     >     It ends very anticlimactic without an obvious error.
Is there
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     anything specific I should be looking for?
     >
     >
     > Normally I look logs around disconnect msgs to find out
the reason.
Post by Richard Neuboeck
Post by Richard Neuboeck
     > But as you said, sometimes one can see just disconnect
msgs without
Post by Richard Neuboeck
Post by Richard Neuboeck
     > any reason. That normally points to reason for disconnect
in the
Post by Richard Neuboeck
Post by Richard Neuboeck
     > network rather than a Glusterfs initiated disconnect.
     The rsync source is serving our homes currently so there are NFS
     connections 24/7. There don't seem to be any network related
     interruptions
Can you set diagnostics.client-log-level and
diagnostics.brick-log-level
Post by Richard Neuboeck
Post by Richard Neuboeck
to TRACE and check logs of both ends of connections - client and
brick?
Post by Richard Neuboeck
Post by Richard Neuboeck
To reduce the logsize, I would suggest to logrotate existing
logs and
Post by Richard Neuboeck
Post by Richard Neuboeck
start with fresh logs when you are about to start so that only
relevant
Post by Richard Neuboeck
Post by Richard Neuboeck
logs are captured. Also, can you take strace of client and brick
process
Post by Richard Neuboeck
Post by Richard Neuboeck
strace -o <outputfile> -ff -v -p <pid>
attach both logs and strace. Let's trace through what syscalls
on socket
Post by Richard Neuboeck
Post by Richard Neuboeck
return and then decide whether to inspect tcpdump or not. If you
don't
Post by Richard Neuboeck
Post by Richard Neuboeck
want to repeat tests again, please capture tcpdump too (on both
ends of
Post by Richard Neuboeck
Post by Richard Neuboeck
connection) and send them to us.
     - a co-worker would be here faster than I could check
     the logs if the connection to home would be broken ;-)
     The three gluster machines are due to this problem reduced
to only
Post by Richard Neuboeck
Post by Richard Neuboeck
     testing so there is nothing else running.
     >
     >     Cheers
     >     Richard
     >
     >     > Normally client logs will give a clue on why the
disconnections are
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     > happening (ping-timeout, wrong port etc). Can you
look into client
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     > logs to figure out what's happening? If you can't
find anything, can
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     > you send across client logs?
     >     >
     >     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
     >     >
     >     >     Hi Gluster Community,
     >     >
     >     >     I have problems with a glusterfs 'Transport
endpoint not
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     connected'
     >     >     connection abort during file transfers that I can
     >     replicate (all the
     >     >     time now) but not pinpoint as to why this is
happening.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >
     >     >     The volume is set up in replica 3 mode and
accessed with
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     the fuse
     >     >     gluster client. Both client and server are
running CentOS
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     and the
     >     >     supplied 3.12.11 version of gluster.
     >     >
     >     >     The connection abort happens at different times
during
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     rsync but
     >     >     occurs every time I try to sync all our files
(1.1TB) to
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     the empty
     >     >     volume.
     >     >
     >     >     Client and server side I don't find errors in
the gluster
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     log files.
     >     >     rsync logs the obvious transfer problem. The
only log that
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     shows
     >     >     anything related is the server brick log which
states
Post by Richard Neuboeck
Post by Richard Neuboeck
     that the
     >     >
     >     >     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
     disconnecting
     >     >     connection from
     >     >   
 brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >     [2018-08-18 22:40:35.502620] W
     >     >     [inodelk.c:499:pl_inodelk_log_cleanup]
     >     releasing lock
     >     >     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
     >     >     {client=0x7f83ec0b3ce0, pid=110423
     lk-owner=d0fd5ffb427f0000}
     >     >     [2018-08-18 22:40:35.502692] W
     >     >     [entrylk.c:864:pl_entrylk_log_cleanup]
     >     releasing lock
     >     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     >     {client=0x7f83ec0b3ce0, pid=110423
     lk-owner=703dd4cc407f0000}
     >     >     [2018-08-18 22:40:35.502719] W
     >     >     [entrylk.c:864:pl_entrylk_log_cleanup]
     >     releasing lock
     >     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
     >     >     {client=0x7f83ec0b3ce0, pid=110423
     lk-owner=703dd4cc407f0000}
     >     >     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
Shutting
Post by Richard Neuboeck
Post by Richard Neuboeck
     down
     >     >     connection
     >     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
     >     >
     >     >     Since I'm running another replica 3 setup for
oVirt for a
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     long time
     >     >     now which is completely stable I thought I made
a mistake
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     setting
     >     >     different options at first. However even when I
reset
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     those options
     >     >     I'm able to reproduce the connection problem.
     >     >
     >     >
     >     >     Volume Name: home
     >     >     Type: Replicate
     >     >     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
     >     >     Status: Started
     >     >     Snapshot Count: 0
     >     >     Number of Bricks: 1 x 3 = 3
     >     >     Transport-type: tcp
     >     >     Brick1: sphere-four:/srv/gluster_home/brick
     >     >     Brick2: sphere-five:/srv/gluster_home/brick
     >     >     Brick3: sphere-six:/srv/gluster_home/brick
     >     >     nfs.disable: on
     >     >     transport.address-family: inet
     >     >     cluster.quorum-type: auto
     >     >     cluster.server-quorum-type: server
     >     >     cluster.server-quorum-ratio: 50%
     >     >
     >     >
     >     >
     >     >     performance.cache-size: 5GB
     >     >     client.event-threads: 4
     >     >     server.event-threads: 4
     >     >     cluster.lookup-optimize: on
     >     >     features.cache-invalidation: on
     >     >     performance.stat-prefetch: on
     >     >     performance.cache-invalidation: on
     >     >     network.inode-lru-limit: 50000
     >     >     features.cache-invalidation-timeout: 600
     >     >     performance.md-cache-timeout: 600
     >     >     performance.parallel-readdir: on
     >     >
     >     >
     >     >     In this case the gluster servers and also the
client is
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     using a
     >     >     bonded network device running in adaptive load
balancing
Post by Richard Neuboeck
Post by Richard Neuboeck
     mode.
     >     >
     >     >     I've tried using the debug option for the client
mount.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     But except
     >     >     for a ~0.5TB log file I didn't get information
that seems
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >     helpful to me.
     >     >
     >     >     Transferring just a couple of GB works without
problems.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >
     >     >     It may very well be that I'm already blind to
the obvious
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     but after
     >     >     many long running tests I can't find the crux in
the setup.
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     >
     >     >     Does anyone have an idea as how to approach this
problem
Post by Richard Neuboeck
Post by Richard Neuboeck
     >     in a way
     >     >     that sheds some useful information?
     >     >
     >     >     Any help is highly appreciated!
     >     >     Cheers
     >     >     Richard
     >     >
     >     >     --
     >     >     /dev/null
     >     >
     >     >
     >     >
     >     >
     >     >     _______________________________________________
     >     >     Gluster-users mailing list
     >     >   
 https://lists.gluster.org/mailman/listinfo/gluster-users
Post by Richard Neuboeck
Post by Richard Neuboeck
     <https://lists.gluster.org/mailman/listinfo/gluster-users>
     >     <https://lists.gluster.org/mailman/listinfo/gluster-users
     <https://lists.gluster.org/mailman/listinfo/gluster-users>>
     >     >   
      <https://lists.gluster.org/mailman/listinfo/gluster-users
     <https://lists.gluster.org/mailman/listinfo/gluster-users>
     >     <https://lists.gluster.org/mailman/listinfo/gluster-users
     <https://lists.gluster.org/mailman/listinfo/gluster-users>>>
     >     >
     >     >
     >
     >
     >     --
     >     /dev/null
     >
     >
     --
     /dev/null
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Continue reading on narkive:
Loading...