[Gluster-users] Client un-mounting since upgrade to 3.12.9-1 version

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest version
3.12.9-1, I am having this issue of gluster getting unmounted from client
very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven servers
totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,
entry-timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how to
troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.

Thanks,
Vijay

mohammad kashif

2018-06-12 10:07:19 UTC

Hi Vijay

Now it is unmounting every 30 mins !

The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only

2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup on
/atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down
connection <server-name>
-2224879-2018/06/12-09:51:01:460889-atlasglust-client-0-0-0

There is no other information. Is there any way to increase log verbosity?

on the client

2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk] 0-atlasglust-client-5:
Connected to atlasglust-client-5, attached to remote volume
'/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk] 0-atlasglust-client-5:
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk] 0-atlasglust-client-5:
Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk] 0-atlasglust-client-6:
Connected to atlasglust-client-6, attached to remote volume
'/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk] 0-atlasglust-client-6:
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk] 0-atlasglust-client-6:
Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse:
switched to graph 0

is there a problem with server and client 1k version?

Thanks for your help.

Kashif

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest version
3.12.9-1, I am having this issue of gluster getting unmounted from client
very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven servers
totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how to
troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

Milind Changire

2018-06-12 11:26:02 UTC

Kashif,
You can change the log level by:
$ gluster volume set <vol> diagnostics.brick-log-level TRACE
$ gluster volume set <vol> diagnostics.client-log-level TRACE

and see how things fare

If you want fewer logs you can change the log-level to DEBUG instead of
TRACE.

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013] [server-helpers.c:289:do_fd_cleanup]
0-atlasglust-server: fd cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055] [client_t.c:443:gf_client_unref]
0-atlasglust-server: Shutting down connection <server-name>
-2224879-2018/06/12-09:51:01:460889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
select_server_supported_programs] 0-atlasglust-client-5: Using Program
GlusterFS 3.3, Num (1298437), Version (330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
select_server_supported_programs] 0-atlasglust-client-6: Using Program
GlusterFS 3.3, Num (1298437), Version (330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest version
3.12.9-1, I am having this issue of gluster getting unmounted from client
very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven servers
totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how to
troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

mohammad kashif

2018-06-12 12:29:08 UTC

Hi Milind, Vijay

Thanks, I have some more information now as I straced glusterd on client

138544 0.000131 mprotect(0x7f2f70785000, 4096, PROT_READ|PROT_WRITE) =
0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096, PROT_READ|PROT_WRITE) =
0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096, PROT_READ|PROT_WRITE) =
0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR,
si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL,
si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++

As for I understand that somehow gluster is trying to access memory in
appropriate manner and kernel sends SIGSEGV

I also got the core dump. I am trying gdb first time so I am not sure
whether I am using it correctly

gdb /usr/sbin/glusterfs core.138536

It just tell me that program terminated with signal 11, segmentation fault .

The problem is not limited to one client but happening to many clients.

I will really appreciate any help as whole file system has become unusable

Thanks

Kashif

Post by Milind Changire
Kashif,
$ gluster volume set <vol> diagnostics.brick-log-level TRACE
$ gluster volume set <vol> diagnostics.client-log-level TRACE
and see how things fare
If you want fewer logs you can change the log-level to DEBUG instead of
TRACE.

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup on
/atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down
connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
Connected to atlasglust-client-5, attached to remote volume
'/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
Connected to atlasglust-client-6, attached to remote volume
'/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest version
3.12.9-1, I am having this issue of gluster getting unmounted from client
very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how to
troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Milind Changire

2018-06-12 14:16:56 UTC

Kashif,
Could you share the core dump via Google Drive or something similar

Also, let me know the CPU arch and OS Distribution on which you are running
gluster.

If you've installed the glusterfs-debuginfo package, you'll also get the
source lines in the backtrace via gdb

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096, PROT_READ|PROT_WRITE)
= 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096, PROT_READ|PROT_WRITE)
= 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096, PROT_READ|PROT_WRITE)
= 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR,
si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL,
si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access memory in
appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not sure
whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11, segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup on
/atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down
connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
Connected to atlasglust-client-5, attached to remote volume
'/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
Connected to atlasglust-client-6, attached to remote volume
'/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest version
3.12.9-1, I am having this issue of gluster getting unmounted from client
very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how to
troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

mohammad kashif

2018-06-12 14:40:01 UTC

Hi Milind

The operating system is Scientific Linux 6 which is based on RHEL6. The cpu
arch is Intel x86_64.

I will send you a separate email with link to core dump.

Thanks for your help.

Kashif

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you are
running gluster.
If you've installed the glusterfs-debuginfo package, you'll also get the
source lines in the backtrace via gdb

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup
on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down
connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
Connected to atlasglust-client-5, attached to remote volume
'/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
Connected to atlasglust-client-6, attached to remote volume
'/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest version
3.12.9-1, I am having this issue of gluster getting unmounted from client
very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how to
troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Vijay Bellur

2018-06-12 14:49:16 UTC

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6. The
cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.

HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR,
si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL,
si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access memory in
appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not sure
whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11, segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup
on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down
connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
Connected to atlasglust-client-5, attached to remote volume
'/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
Connected to atlasglust-client-6, attached to remote volume
'/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest version
3.12.9-1, I am having this issue of gluster getting unmounted from client
very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how
to troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Milind Changire

2018-06-12 15:14:48 UTC

Kashif,
Could you also send over the client/mount log file as Vijay suggested ?
Or maybe the lines with the crash backtrace lines

Also, you've mentioned that you straced glusterd, but when you ran gdb, you
ran it over /usr/sbin/glusterfs

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6. The
cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL,
si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access memory in
appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not sure
whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11, segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup
on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down
connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how
to troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

mohammad kashif

2018-06-12 16:01:36 UTC

Hi Milind

I will send you links for logs.

I collected these core dumps at client and there is no glusterd process
running on client.

Kashif

Post by Milind Changire
Kashif,
Could you also send over the client/mount log file as Vijay suggested ?
Or maybe the lines with the crash backtrace lines
Also, you've mentioned that you straced glusterd, but when you ran gdb,
you ran it over /usr/sbin/glusterfs

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6. The
cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL,
si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access memory in
appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not sure
whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11, segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down
connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that how
to troubleshoot this issue?

Can you please share the log file? Checking for messages related to
disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

mohammad kashif

2018-06-13 09:51:44 UTC

Hi Milind

There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo. Do you
know from where I can get it?
Also when I run gdb, it says

Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64

I can't find debug package for glusterfs-fuse either

Thanks from the pit of despair ;)

Kashif

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd process
running on client.
Kashif

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6. The
cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access memory
in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not sure
whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11, segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that
how to troubleshoot this issue?

Can you please share the log file? Checking for messages related
to disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Milind Changire

2018-06-13 10:34:15 UTC

Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo. Do
you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd process
running on client.
Kashif

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access memory
in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11, segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have seven
servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that
how to troubleshoot this issue?

Can you please share the log file? Checking for messages related
to disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

mohammad kashif

2018-06-13 10:59:11 UTC

Hi Milind

Thanks a lot, I manage to run gdb and produced traceback as well. Its here

http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log

I am trying to understand but still not able to make sense out of it.

Thanks

Kashif

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd process
running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access memory
in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log
verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that
how to troubleshoot this issue?

Can you please share the log file? Checking for messages related
to disconnections/crashes in the log file would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Milind Changire

2018-06-13 11:10:47 UTC

+Nithya

Nithya,
Do these logs [1] look similar to the recursive readdir() issue that you
encountered just a while back ?
i.e. recursive readdir() response definition in the XDR

[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log

Post by mohammad kashif
Hi Milind
Thanks a lot, I manage to run gdb and produced traceback as well. Its here
http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
I am trying to understand but still not able to make sense out of it.
Thanks
Kashif

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd process
running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log
verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that
how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Nithya Balachandran

2018-06-14 04:37:50 UTC

This is not the same issue as the one you are referring - that was in the
RPC layer and caused the bricks to crash. This one is different as it seems
to be in the dht and rda layers. It does look like a stack overflow though.

@Mohammad,

Please send the following information:

1. gluster volume info
2. The number of entries in the directory being listed
3. System memory

Does this still happen if you turn off parallel-readdir?

Regards,
Nithya

Post by Milind Changire
+Nithya
Nithya,
Do these logs [1] look similar to the recursive readdir() issue that you
encountered just a while back ?
i.e. recursive readdir() response definition in the XDR
[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd
process running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become
unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log
verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Nithya Balachandran

2018-06-14 04:39:39 UTC

+Poornima who works on parallel-readdir.

@Poornima, Have you seen anything like this before?

Post by Nithya Balachandran
This is not the same issue as the one you are referring - that was in the
RPC layer and caused the bricks to crash. This one is different as it seems
to be in the dht and rda layers. It does look like a stack overflow though.
@Mohammad,
1. gluster volume info
2. The number of entries in the directory being listed
3. System memory
Does this still happen if you turn off parallel-readdir?
Regards,
Nithya

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd
process running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd
on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

mohammad kashif

2018-06-14 11:12:47 UTC

Hi Nithya

It seems that problem can be solved by either turning parallel-readir off
or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to
3.10.12-1 and it seems to fixed the problem. Today when I saw your email
then I disabled parallel-readir off and the current client 3.12.9-1
started to work. I upgraded server and clients to 3.12.9-1 last month
and since then clients were intermittently unmounting once in a week. But
during last three days, it started unmounting every few minutes. I don't
know that what triggered this sudden panic except that file system was
quite full; around 98%. It is 480 TB file system. The file system has
almost 80 Million files.

Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with
192GB RAM client and it still had the same issue.

Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 7
Transport-type: tcp
Bricks:
Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
Options Reconfigured:
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
performance.cache-invalidation: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.client-io-threads: on
performance.cache-size: 1GB
performance.parallel-readdir: off
performance.md-cache-timeout: 600
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
auth.allow: X.Y.Z.*
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Thanks

Kashif

Post by Nithya Balachandran
+Poornima who works on parallel-readdir.
@Poornima, Have you seen anything like this before?

Post by Milind Changire
+Nithya
Nithya,
Do these logs [1] look similar to the recursive readdir() issue that
you encountered just a while back ?
i.e. recursive readdir() response definition in the XDR
[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd
process running on client.
Kashif
On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd
on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many
clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout
600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads
on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Jim Kinney

2018-06-14 14:30:51 UTC

Hmm. I have a 3.12.9 volume (several) with 3.12.9 clients that are
dropping the mount yet parallel.readdir is off. This is only happening
on the RDMA interface. The TCP transport mounts are fine.
Option Value
------ --
--- cluster.lookup-
unhashed on cluste
r.lookup-
optimize off cluste
r.min-free-
disk 10% cluster.
min-free-
inodes 5% cluster.
rebalance-
stats off cluster.s
ubvols-per-
directory (null) cluster.rea
ddir-
optimize off cluster
.rsync-hash-
regex (null) cluster.ex
tra-hash-
regex (null) cluster.dh
t-xattr-
name trusted.glusterfs.dht cluster.r
andomize-hash-range-by-
gfid off cluster.rebal-
throttle normal clust
er.lock-
migration off clus
ter.local-volume-
name (null) cluster.weig
hted-
rebalance on cluster.
switch-
pattern (null) cluste
r.entry-change-
log on cluster.read
-subvolume (null) clu
ster.read-subvolume-index -
1 cluster.read-hash-
mode 1 cluster.b
ackground-self-heal-
count 8 cluster.metadata-
self-
heal on cluster.data-
self-
heal on cluster.e
ntry-self-
heal on cluster.se
lf-heal-
daemon enable cluster.h
eal-
timeout 600 clus
ter.self-heal-window-
size 1 cluster.data-
change-
log on cluster.met
adata-change-
log on cluster.data-
self-heal-
algorithm (null) cluster.eager-
lock on dispe
rse.eager-
lock on cluste
r.quorum-
type none cluste
r.quorum-
count (null) cluste
r.choose-
local true cluste
r.self-heal-readdir-
size 1KB cluster.post-op-
delay-
secs 1 cluster.ensur
e-
durability on cluste
r.consistent-
metadata no cluster.he
al-wait-queue-
length 128 cluster.favorit
e-child-
policy none cluster.stripe
-block-
size 128KB cluster.stri
pe-
coalesce true diagno
stics.latency-
measurement off diagnostics
.dump-fd-
stats off diagnostics
.count-fop-
hits off diagnostics.b
rick-log-
level INFO diagnostics.c
lient-log-
level INFO diagnostics.br
ick-sys-log-
level CRITICAL diagnostics.clien
t-sys-log-
level CRITICAL diagnostics.brick-
logger (null) diagnosti
cs.client-
logger (null) diagnostic
s.brick-log-
format (null) diagnostics.c
lient-log-
format (null) diagnostics.br
ick-log-buf-
size 5 diagnostics.clien
t-log-buf-
size 5 diagnostics.brick-
log-flush-
timeout 120 diagnostics.client-
log-flush-
timeout 120 diagnostics.stats-
dump-
interval 0 diagnostics.fo
p-sample-
interval 0 diagnostics.st
ats-dump-
format json diagnostics.fo
p-sample-buf-
size 65535 diagnostics.stats-
dnscache-ttl-
sec 86400 performance.cache-max-
file-
size 0 performance.cache-
min-file-
size 0 performance.cache-
refresh-
timeout 1 performance.cache
-priority performa
nce.cache-
size 32MB performan
ce.io-thread-
count 16 performance.h
igh-prio-
threads 16 performance.n
ormal-prio-
threads 16 performance.low
-prio-
threads 16 performance.
least-prio-
threads 1 performance.en
able-least-
priority on performance.cach
e-
size 128MB performan
ce.flush-
behind on performan
ce.nfs.flush-
behind on performance.w
rite-behind-window-
size 1MB performance.resync-
failed-syncs-after-
fsyncoff performance.nfs.write-
behind-window-
size1MB performance.strict-o-
direct off performance.
nfs.strict-o-
direct off performance.stri
ct-write-
ordering off performance.nfs.
strict-write-
ordering off performance.lazy-
open yes performa
nce.read-after-
open no performance.re
ad-ahead-page-
count 4 performance.md-
cache-
timeout 1 performance.
cache-swift-
metadata true performance.cac
he-samba-
metadata false performance.cac
he-capability-
xattrs true performance.cache-
ima-
xattrs true features.encr
yption off encr
yption.master-
key (null) encryptio
n.data-key-
size 256 encryption.
block-
size 4096 network.
frame-
timeout 1800 netwo
rk.ping-
timeout 42 netw
ork.tcp-window-
size (null) features.l
ock-
heal off featu
res.grace-
timeout 10 networ
k.remote-
dio disable client
.event-
threads 2 clie
nt.tcp-user-
timeout 0 client.
keepalive-
time 20 client.k
eepalive-
interval 2 client.k
eepalive-
count 9 network.
tcp-window-
size (null) network.in
ode-lru-
limit 16384 auth.allo
w *
auth.reject (null)
transport.keepalive 1
server.allow-
insecure (null) serv
er.root-
squash off ser
ver.anonuid 65534
server.anongid 65534
server.statedump-
path /var/run/gluster server.o
utstanding-rpc-
limit 64 features.lock-
heal off featu
res.grace-
timeout 10 server
.ssl (null)
auth.ssl-
allow *
server.manage-
gids off serve
r.dynamic-
auth on client
.send-
gids on ser
ver.gid-
timeout 300 se
rver.own-
thread (null) se
rver.event-
threads 1 serv
er.tcp-user-
timeout 0 server.
keepalive-
time 20 server.k
eepalive-
interval 2 server.k
eepalive-
count 9 transpor
t.listen-
backlog 10 ssl.own-
cert (null)
ssl.private-
key (null) ssl
.ca-
list (null)
ssl.crl-
path (null)
ssl.certificate-
depth (null) ssl.cip
her-
list (null) ss
l.dh-
param (null)
ssl.ec-
curve (null)
performance.write-
behind on performan
ce.read-
ahead on performa
nce.readdir-
ahead on performance
.io-
cache on perfor
mance.quick-
read on performan
ce.open-
behind on performa
nce.nl-
cache off perfor
mance.stat-
prefetch on performa
nce.client-io-
threads on performance.n
fs.write-
behind on performance.n
fs.read-
ahead off performance.
nfs.io-
cache off performanc
e.nfs.quick-
read off performance.n
fs.stat-
prefetch off performance.
nfs.io-
threads off performanc
e.force-
readdirp true performan
ce.cache-
invalidation false features.
uss off
features.snapshot-
directory .snaps features.
show-snapshot-
directory off network.compre
ssion off netwo
rk.compression.window-size -
15 network.compression.mem-
level 8 network.compres
sion.min-
size 0 network.compres
sion.compression-level -
1 network.compression.debug
false features.limit-
usage (null) featur
es.default-soft-
limit 80% features.soft
-timeout 60 feat
ures.hard-
timeout 5 featu
res.alert-
time 86400 featur
es.quota-deem-
statfs off geo-
replication.indexing off
geo-
replication.indexing off
geo-replication.ignore-pid-
check off geo-
replication.ignore-pid-
check off features.quota
off features.
inode-
quota off featur
es.bitrot disable
debug.trace off
debug.log-
history no d
ebug.log-
file no d
ebug.exclude-
ops (null) debug
.include-
ops (null) debug
.error-
gen off deb
ug.error-
failure (null) deb
ug.error-
number (null) deb
ug.random-
failure off debu
g.error-
fops (null) nfs
.disable off
features.read-
only off featu
res.worm off
features.worm-file-
level off features.d
efault-retention-
period 120 features.retention
-mode relax features.
auto-commit-
period 180 storage.linu
x-
aio off stora
ge.batch-fsync-mode reverse-
fsync storage.batch-fsync-delay-
usec 0 storage.owner-
uid -
1 storage.owner-
gid -
1 storage.node-uuid-
pathinfo off storage.h
ealth-check-
interval 30 storage.buil
d-
pgfid off stora
ge.gfid2path on
storage.gfid2path-
separator : storage.b
d-
aio off cl
uster.server-quorum-
type off cluster.serve
r-quorum-
ratio 0 changelog.cha
ngelog off chan
gelog.changelog-
dir (null) changelog.e
ncoding ascii ch
angelog.rollover-
time 15 changelog.
fsync-
interval 5 changel
og.changelog-barrier-
timeout 120 changelog.capture-
del-
path off features.barr
ier disable feat
ures.barrier-
timeout 120 features
.trash off
features.trash-
dir .trashcan featur
es.trash-eliminate-
path (null) features.trash-
max-
filesize 5MB features.t
rash-internal-
op off cluster.enable-
shared-
storage disable cluster.write
-freq-
threshold 0 cluster.re
ad-freq-
threshold 0 cluster.t
ier-
pause off clus
ter.tier-promote-
frequency 120 cluster.tier
-demote-
frequency 3600 cluster.wat
ermark-
hi 90 cluster.w
atermark-
low 75 cluster.t
ier-
mode cache clus
ter.tier-max-promote-file-
size 0 cluster.tier-max-
mb 4000 cluster.
tier-max-
files 10000 cluster.
tier-query-
limit 100 cluster.ti
er-
compact on clus
ter.tier-hot-compact-
frequency 604800 cluster.tier-
cold-compact-
frequency 604800 features.ctr-
enabled off feat
ures.record-
counters off feature
s.ctr-record-metadata-
heat off features.ctr_link_co
nsistency off features.ct
r_lookupheal_link_timeout 300 fe
atures.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-
cachesize 12500 features.ct
r-sql-db-wal-
autocheckpoint 25000 features.selinu
x on locks.
trace off
locks.mandatory-
locking off cluster
.disperse-self-heal-
daemon enable cluster.quorum-
reads no client
.bind-
insecure (null) fea
tures.shard off
features.shard-block-
size 64MB features.scr
ub-
throttle lazy featur
es.scrub-
freq biweekly featur
es.scrub false
features.expiry-
time 120 feature
s.cache-
invalidation off featur
es.cache-invalidation-
timeout 60 features.leases
off features.l
ease-lock-recall-
timeout 60 disperse.backgroun
d-
heals 8 disperse.he
al-wait-
qlength 128 cluster.he
al-
timeout 600 dht.
force-
readdirp on d
isperse.read-policy gfid-
hash cluster.shd-max-
threads 1 cluster
.shd-wait-
qlength 1024 cluster.
locking-
scheme full cluster
.granular-entry-
heal no features.locks
-revocation-
secs 0 features.locks-
revocation-clear-
all false features.locks-
revocation-max-
blocked 0 features.locks-
monkey-
unlocking false disperse.shd-
max-
threads 1 disperse
.shd-wait-
qlength 1024 disperse.
cpu-
extensions auto disp
erse.self-heal-window-
size 1 cluster.use-
compound-
fops off performance.
parallel-
readdir off performance.
rda-request-
size 131072 performance.rda
-low-
wmark 4096 performance
.rda-high-
wmark 128KB performance.
rda-cache-
limit 10MB performance.n
l-cache-positive-
entry false performance.nl-cache-
limit 10MB performance.
nl-cache-
timeout 60 cluster.bric
k-
multiplex off clust
er.max-bricks-per-
process 0 disperse.optim
istic-change-
log on cluster.halo-
enabled False clus
ter.halo-shd-max-
latency 99999 cluster.halo
-nfsd-max-
latency 5 cluster.halo-
max-
latency 5 cluster.
halo-max-
replicas 99999 cluster.
halo-min-replicas 2

Post by mohammad kashif
Hi Nithya
It seems that problem can be solved by either turning parallel-readir
off or downgrading client to 3.10.12-1 . Yesterday I downgraded some
clients to 3.10.12-1 and it seems to fixed the problem. Today when I
saw your email then I disabled parallel-readir off and the current
client 3.12.9-1 started to work. I upgraded server and clients to
3.12.9-1 last month and since then clients were intermittently
unmounting once in a week. But during last three days, it started
unmounting every few minutes. I don't know that what triggered this
sudden panic except that file system was quite full; around 98%. It
is 480 TB file system. The file system has almost 80 Million files.
Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested
with 192GB RAM client and it still had the same issue.
Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 7
Transport-type: tcp
Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
performance.cache-invalidation: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.client-io-threads: on
performance.cache-size: 1GB
performance.parallel-readdir: off
performance.md-cache-timeout: 600
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
auth.allow: X.Y.Z.*
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
Thanks
Kashif

Post by Nithya Balachandran
+Poornima who works on parallel-readdir.
@Poornima, Have you seen anything like this before?

Post by Nithya Balachandran
This is not the same issue as the one you are referring - that
was in the RPC layer and caused the bricks to crash. This one is
different as it seems to be in the dht and rda layers. It does
look like a stack overflow though.
@Mohammad,
1. gluster volume info
2. The number of entries in the directory being listed
3. System memory
Does this still happen if you turn off parallel-readdir?
Regards,
Nithya

Post by Milind Changire
+Nithya
Nithya,
Do these logs [1] look similar to the recursive readdir()
issue that you encountered just a while back ?
i.e. recursive readdir() response definition in the XDR
[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <kashif.al

Post by mohammad kashif
Hi Milind
There is no
glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-
3.12/ repo. Do
you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <kashif.

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no
glusterd process running on client.
Kashif
On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <mchan

Post by Milind Changire
Kashif,
Could you also send over the client/mount log file as
Vijay suggested ?
Or maybe the lines with the crash backtrace lines
Also, you've mentioned that you straced glusterd, but
when you ran gdb, you ran it over /usr/sbin/glusterfs
On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbellu

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <k

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which
is based on RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to
core dump.

You could also grep for crash in the client log
file and the lines following crash would have a
backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive
or something similar
Also, let me know the CPU arch and OS
Distribution on which you are running gluster.
If you've installed the glusterfs-debuginfo
package, you'll also get the source lines in
the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I
straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000,
4096, PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000,
4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000,
4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV
{si_signo=SIGSEGV, si_code=SEGV_ACCERR,
si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV
{si_signo=SIGSEGV, si_code=SI_KERNEL,
si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV
(core dumped) +++
138550 0.000041 +++ killed by SIGSEGV
(core dumped) +++
138547 0.000008 +++ killed by SIGSEGV
(core dumped) +++
138546 0.000007 +++ killed by SIGSEGV
(core dumped) +++
138545 0.000007 +++ killed by SIGSEGV
(core dumped) +++
138544 0.000008 +++ killed by SIGSEGV
(core dumped) +++
138543 0.000007 +++ killed by SIGSEGV
(core dumped) +++
As for I understand that somehow gluster is
trying to access memory in appropriate manner
and kernel sends SIGSEGV
I also got the core dump. I am trying gdb
first time so I am not sure whether I am
using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with
signal 11, segmentation fault .
The problem is not limited to one client but
happening to many clients.
I will really appreciate any help as whole
file system has become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind

Post by Milind Changire
Kashif,
$ gluster volume set <vol>
diagnostics.brick-log-level TRACE
$ gluster volume set <vol>
diagnostics.client-log-level TRACE
and see how things fare
If you want fewer logs you can change the
log-level to DEBUG instead of TRACE.
On Tue, Jun 12, 2018 at 3:37 PM, mohammad

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at
/var/log/glusterfs/bricks/glusteratlas-
brics001-gv0.log have this line only
115013] [server-
helpers.c:289:do_fd_cleanup] 0-
atlasglust-server: fd cleanup on
/atlas/atlasdata/zgubic/hmumu/histograms/
v14.3/Signal
101055] [client_t.c:443:gf_client_unref]
0-atlasglust-server: Shutting down
connection <server-name> -2224879-
2018/06/12-09:51:01:460889-atlasglust-
client-0-0-0
There is no other information. Is there
any way to increase log verbosity?
on the client
114057] [client-
handshake.c:1478:select_server_supported_
programs] 0-atlasglust-client-5: Using
Program GlusterFS 3.3, Num (1298437),
Version (330)
114046] [client-
handshake.c:1231:client_setvolume_cbk] 0-
atlasglust-client-5: Connected to
atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
114047] [client-
handshake.c:1242:client_setvolume_cbk] 0-
atlasglust-client-5: Server and Client
lk-version numbers are not same,
reopening the fds
114035] [client-
handshake.c:202:client_set_lk_version_cbk
] 0-atlasglust-client-5: Server lk
version = 1
114057] [client-
handshake.c:1478:select_server_supported_
programs] 0-atlasglust-client-6: Using
Program GlusterFS 3.3, Num (1298437),
Version (330)
114046] [client-
handshake.c:1231:client_setvolume_cbk] 0-
atlasglust-client-6: Connected to
atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
114047] [client-
handshake.c:1242:client_setvolume_cbk] 0-
atlasglust-client-6: Server and Client
lk-version numbers are not same,
reopening the fds
114035] [client-
handshake.c:202:client_set_lk_version_cbk
] 0-atlasglust-client-6: Server lk
version = 1
[2018-06-12 09:51:01.752207] I [fuse-
bridge.c:4205:fuse_init] 0-glusterfs-
glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I [fuse-
switched to graph 0
is there a problem with server and client
1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay

On Mon, Jun 11, 2018 at 8:50 AM,

Post by mohammad kashif
Hi
Since I have updated our gluster
server and client to latest version
3.12.9-1, I am having this issue of
gluster getting unmounted from client
very regularly. It was not a problem
before update.
Its a distributed file system with no
replication. We have seven servers
totaling around 480TB data. Its 97%
full.
I am using following config on server
gluster volume set atlasglust
features.cache-invalidation on
gluster volume set atlasglust
features.cache-invalidation-timeout
600
gluster volume set atlasglust
performance.stat-prefetch on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust
performance.md-cache-timeout 600
gluster volume set atlasglust
performance.parallel-readdir on
gluster volume set atlasglust
performance.cache-size 1GB
gluster volume set atlasglust
performance.client-io-threads on
gluster volume set atlasglust
cluster.lookup-optimize on
gluster volume set atlasglust
performance.stat-prefetch on
gluster volume set atlasglust
client.event-threads 4
gluster volume set atlasglust
server.event-threads 4
clients are mounted with this option
defaults,direct-io-
mode=disable,attribute-
timeout=600,entry-
timeout=600,negative-
timeout=600,fopen-keep-
cache,rw,_netdev
I can't see anything in the log file.
Can someone suggest that how to
troubleshoot this issue?

Can you please share the log file?
Checking for messages related to
disconnections/crashes in the log file
would be a good way to start
troubleshooting the problem.
Thanks,
Vijay

_________________________________________
______
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo
/gluster-users

___________________________________________
____Gluster-users mailing listGluster-
ailman/listinfo/gluster-users

--
James P. Kinney III

Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/

Nithya Balachandran

2018-06-15 08:15:48 UTC

Hi Mohammad,

I was unable to reproduce this on a volume created on a system running
3.12.9.

Can you send me the FUSE volfiles for the volume atlasglust? They will be
in /var/lib/glusterd/vols/atlasglust/ on any of the gluster servers
hosting the volume and called *.tcp-fuse.vol.

Thanks,
Nithya

Post by mohammad kashif
Hi Nithya
It seems that problem can be solved by either turning parallel-readir off
or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to
3.10.12-1 and it seems to fixed the problem. Today when I saw your email
then I disabled parallel-readir off and the current client 3.12.9-1
started to work. I upgraded server and clients to 3.12.9-1 last month
and since then clients were intermittently unmounting once in a week. But
during last three days, it started unmounting every few minutes. I don't
know that what triggered this sudden panic except that file system was
quite full; around 98%. It is 480 TB file system. The file system has
almost 80 Million files.
Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with
192GB RAM client and it still had the same issue.
Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 7
Transport-type: tcp
Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
performance.cache-invalidation: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.client-io-threads: on
performance.cache-size: 1GB
performance.parallel-readdir: off
performance.md-cache-timeout: 600
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
auth.allow: X.Y.Z.*
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
Thanks
Kashif

Post by Nithya Balachandran
+Poornima who works on parallel-readdir.
@Poornima, Have you seen anything like this before?

Post by Nithya Balachandran
This is not the same issue as the one you are referring - that was in
the RPC layer and caused the bricks to crash. This one is different as it
seems to be in the dht and rda layers. It does look like a stack overflow
though.
@Mohammad,
1. gluster volume info
2. The number of entries in the directory being listed
3. System memory
Does this still happen if you turn off parallel-readdir?
Regards,
Nithya

Post by Milind Changire
+Nithya
Nithya,
Do these logs [1] look similar to the recursive readdir() issue that
you encountered just a while back ?
i.e. recursive readdir() response definition in the XDR
[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/
repo. Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the
lines following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll
also get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd
on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many
clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
fd cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I
[fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol
versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation
on
gluster volume set atlasglust
features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout
600
gluster volume set atlasglust performance.parallel-readdir
on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust
performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Nithya Balachandran

2018-06-15 08:19:40 UTC

Post by Nithya Balachandran
Hi Mohammad,
I was unable to reproduce this on a volume created on a system running
3.12.9.
Can you send me the FUSE volfiles for the volume atlasglust? They will be
in /var/lib/glusterd/vols/atlasglust/ on any of the gluster servers
hosting the volume and called *.tcp-fuse.vol.

Can you also send the same files after enabling parallel-readdir?

Post by Nithya Balachandran
Thanks,
Nithya

Post by mohammad kashif
Hi Nithya
It seems that problem can be solved by either turning parallel-readir off
or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to
3.10.12-1 and it seems to fixed the problem. Today when I saw your email
then I disabled parallel-readir off and the current client 3.12.9-1
started to work. I upgraded server and clients to 3.12.9-1 last month
and since then clients were intermittently unmounting once in a week. But
during last three days, it started unmounting every few minutes. I don't
know that what triggered this sudden panic except that file system was
quite full; around 98%. It is 480 TB file system. The file system has
almost 80 Million files.
Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with
192GB RAM client and it still had the same issue.
Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 7
Transport-type: tcp
Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
performance.cache-invalidation: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.client-io-threads: on
performance.cache-size: 1GB
performance.parallel-readdir: off
performance.md-cache-timeout: 600
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
auth.allow: X.Y.Z.*
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
Thanks
Kashif

Post by Nithya Balachandran
+Poornima who works on parallel-readdir.
@Poornima, Have you seen anything like this before?

Post by Nithya Balachandran
This is not the same issue as the one you are referring - that was in
the RPC layer and caused the bricks to crash. This one is different as it
seems to be in the dht and rda layers. It does look like a stack overflow
though.
@Mohammad,
1. gluster volume info
2. The number of entries in the directory being listed
3. System memory
Does this still happen if you turn off parallel-readdir?
Regards,
Nithya

Post by Milind Changire
+Nithya
Nithya,
Do these logs [1] look similar to the recursive readdir() issue that
you encountered just a while back ?
i.e. recursive readdir() response definition in the XDR
[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/
repo. Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the
lines following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which
you are running gluster.
If you've installed the glusterfs-debuginfo package, you'll
also get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced
glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to
many clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
fd cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I
[fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol
versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I
[fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to
graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We
have seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation
on
gluster volume set atlasglust
features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust
performance.md-cache-timeout 600
gluster volume set atlasglust
performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust
performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

mohammad kashif

2018-06-17 22:49:56 UTC

Post by Nithya Balachandran

Hi Nithya

Fuse volfiles is here after disabling parallel-readdir
http://www-pnp.physics.ox.ac.uk/~mohammad/atlasglust.tcp-fuse.vol
a
Unfortunately I can't take risk of enabling parallel-readdir as the cluster
is in heavy use and likely to kill many jobs if clients unmounted again.

There is one thing which I haven't mentioned earlier to make thing simpler.
I have another 300TB gluster cluster which is only 70% full and have much
less number of files. It has parallel-readdir enabled and some of the
clients are common to both. But I didn't have any problem with this
cluster.

I suspect that the problem triggered when atlasglust became more than 98%
full or crossed a certain number of files. But upgrading clients to 3.12.9
was definitely an issue as this particular problem started after that.
Rolling back some clients to 3.10 while keeping parallel-readir enabled
also fixed this prblem.

Thanks

Kashif

Post by Nithya Balachandran
Hi Mohammad,
I was unable to reproduce this on a volume created on a system running
3.12.9.
Can you send me the FUSE volfiles for the volume atlasglust? They will
be in /var/lib/glusterd/vols/atlasglust/ on any of the gluster servers
hosting the volume and called *.tcp-fuse.vol.

Can you also send the same files after enabling parallel-readdir?

Post by Nithya Balachandran
Thanks,
Nithya

Post by mohammad kashif
Hi Nithya
It seems that problem can be solved by either turning parallel-readir
off or downgrading client to 3.10.12-1 . Yesterday I downgraded some
clients to 3.10.12-1 and it seems to fixed the problem. Today when I saw
your email then I disabled parallel-readir off and the current client
3.12.9-1 started to work. I upgraded server and clients to 3.12.9-1 last
month and since then clients were intermittently unmounting once in a week.
But during last three days, it started unmounting every few minutes. I
don't know that what triggered this sudden panic except that file system
was quite full; around 98%. It is 480 TB file system. The file system has
almost 80 Million files.
Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with
192GB RAM client and it still had the same issue.
Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 7
Transport-type: tcp
Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
performance.cache-invalidation: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.client-io-threads: on
performance.cache-size: 1GB
performance.parallel-readdir: off
performance.md-cache-timeout: 600
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
auth.allow: X.Y.Z.*
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
Thanks
Kashif
On Thu, Jun 14, 2018 at 5:39 AM, Nithya Balachandran <

Post by Nithya Balachandran
+Poornima who works on parallel-readdir.
@Poornima, Have you seen anything like this before?

Post by Nithya Balachandran
This is not the same issue as the one you are referring - that was in
the RPC layer and caused the bricks to crash. This one is different as it
seems to be in the dht and rda layers. It does look like a stack overflow
though.
@Mohammad,
1. gluster volume info
2. The number of entries in the directory being listed
3. System memory
Does this still happen if you turn off parallel-readdir?
Regards,
Nithya

Post by Milind Changire
+Nithya
Nithya,
Do these logs [1] look similar to the recursive readdir() issue that
you encountered just a while back ?
i.e. recursive readdir() response definition in the XDR
[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/
repo. Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the
lines following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something
similar
Also, let me know the CPU arch and OS Distribution on which
you are running gluster.
If you've installed the glusterfs-debuginfo package, you'll
also get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced
glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to
many clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
fd cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to
increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I
[fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol
versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I
[fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to
graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We
have seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust
features.cache-invalidation on
gluster volume set atlasglust
features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch
on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust
performance.md-cache-timeout 600
gluster volume set atlasglust
performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust
performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch
on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone
suggest that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Raghavendra Gowdappa

2018-06-18 02:41:48 UTC

From the bt:

#8 0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#9 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388

It looks like an infinite recursion. Note that readdirp is wound to the
same subvol (value of "this" is same in all calls to rda_readdirp) at the
same offset (of value 2). This may be a bug in DHT (winding down readdirp
with wrong offset) or in readdir-ahead (populating incorrect offset values
in dentries it returns as readdirp response).

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd process
running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log
verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that
how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Raghavendra Gowdappa

2018-06-18 03:46:49 UTC

Post by Raghavendra Gowdappa
#8 0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#9 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
It looks like an infinite recursion. Note that readdirp is wound to the
same subvol (value of "this" is same in all calls to rda_readdirp) at the
same offset (of value 2). This may be a bug in DHT (winding down readdirp
with wrong offset) or in readdir-ahead (populating incorrect offset values
in dentries it returns as readdirp response).

There has been quite a bit of code change in readdir-ahead and dht-readdirp
between the good and bad release. @Poornima, can you check for anything
relevant in readdir-ahead, while I check for anything interesting in
dht-readdirp?

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd
process running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become
unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log
verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Raghavendra Gowdappa

2018-06-18 04:09:17 UTC

It looks to be a corruption. Value of size argument in rda_readdirp is too
big (around 127 TB) to be sane. If you've a reproducer, please run it in
valgrind or ASAN.

To make it explicit, ATM its not clear that there is bug in readdir-ahead
or DHT as it looks to be a memory corruption. Till I get a reproducer or
valgrind/ASAN output of client process when the issue occcurs, I won't be
working on this problem.

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd
process running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become
unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log
verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Raghavendra Gowdappa

2018-06-18 05:27:58 UTC

It looks to be a corruption. Value of size argument in rda_readdirp is too
big (around 127 TB) to be sane. If you've a reproducer, please run it in
valgrind or ASAN.

I spoke too early. It could be a negative value and hence it may not be a
corruption. Is it possible to upload the core somewhere? Or better still
access to gdb session with this core would be more helpful.

Post by Raghavendra Gowdappa
To make it explicit, ATM its not clear that there is bug in readdir-ahead
or DHT as it looks to be a memory corruption. Till I get a reproducer or
valgrind/ASAN output of client process when the issue occcurs, I won't be
working on this problem.

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd
process running on client.
Kashif

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd
on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

mohammad kashif

2018-06-18 08:39:20 UTC

Hi

Problem appeared again after few days. This time, the client
is glusterfs-3.10.12-1.el6.x86_64 and performance.parallel-readdir is off.
The log level was set to ERROR and I got this log at the time of crash

[2018-06-14 08:45:43.551384] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x7fac2e66ce03] (-->
/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fac2e434867] (-->
/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fac2e43497e] (-->
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x7fac2e434a45]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x7fac2e434d68] )))))
0-atlasglust-client-4: forced unwinding frame type(GlusterFS 3.3)
op(READDIRP(40)) called at 2018-06-14 08:45:43.483303 (xid=0x7553c7

Core dump was enabled on client so it created a dump. It is here

http://www-pnp.physics.ox.ac.uk/~mohammad
<http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log>/core.1002074

I used a gdb trace using this command

gdb /usr/sbin/glusterfs core.1002074 -ex bt -ex quit |& tee
backtrace.log_18_16_1

http://www-pnp.physics.ox.ac.uk/~mohammad
<http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log>/backtrace.log_18_16_1

I haven't used gdb much so let me know if you want me to run gdb in
different manner.

Thanks

Kashif

On Mon, Jun 18, 2018 at 8:11 AM, Raghavendra Gowdappa <

It looks to be a corruption. Value of size argument in rda_readdirp is
too big (around 127 TB) to be sane. If you've a reproducer, please run it
in valgrind or ASAN.

To make it explicit, ATM its not clear that there is bug in readdir-ahead
or DHT as it looks to be a memory corruption. Till I get a reproducer or
valgrind/ASAN output of client process when the issue occcurs, I won't be
working on this problem.

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd
on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many
clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout
600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads
on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Nithya Balachandran

2018-06-20 11:28:00 UTC

Hi Mohammad,

This is a different crash. How often does it happen?

We have managed to reproduce the first crash you reported and a bug has
been filed at [1].
We will work on a fix for this.

Regards,
Nithya

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1593199

Post by mohammad kashif
Hi
Problem appeared again after few days. This time, the client
is glusterfs-3.10.12-1.el6.x86_64 and performance.parallel-readdir is
off. The log level was set to ERROR and I got this log at the time of crash
[2018-06-14 08:45:43.551384] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x7fac2e66ce03]
(--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fac2e434867]
(--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fac2e43497e]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x7fac2e434a45]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x7fac2e434d68]
))))) 0-atlasglust-client-4: forced unwinding frame type(GlusterFS 3.3)
op(READDIRP(40)) called at 2018-06-14 08:45:43.483303 (xid=0x7553c7
Core dump was enabled on client so it created a dump. It is here
http://www-pnp.physics.ox.ac.uk/~mohammad
<http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log>/core.1002074
I used a gdb trace using this command
gdb /usr/sbin/glusterfs core.1002074 -ex bt -ex quit |& tee
backtrace.log_18_16_1
http://www-pnp.physics.ox.ac.uk/~mohammad
<http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log>
/backtrace.log_18_16_1
I haven't used gdb much so let me know if you want me to run gdb in
different manner.
Thanks
Kashif

On Mon, Jun 18, 2018 at 9:39 AM, Raghavendra Gowdappa <

On Mon, Jun 18, 2018 at 8:11 AM, Raghavendra Gowdappa <

Post by Raghavendra Gowdappa
#8 0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#9 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
It looks like an infinite recursion. Note that readdirp is wound to the
same subvol (value of "this" is same in all calls to rda_readdirp) at the
same offset (of value 2). This may be a bug in DHT (winding down readdirp
with wrong offset) or in readdir-ahead (populating incorrect offset values
in dentries it returns as readdirp response).

It looks to be a corruption. Value of size argument in rda_readdirp is
too big (around 127 TB) to be sane. If you've a reproducer, please run it
in valgrind or ASAN.

To make it explicit, ATM its not clear that there is bug in
readdir-ahead or DHT as it looks to be a memory corruption. Till I get a
reproducer or valgrind/ASAN output of client process when the issue
occcurs, I won't be working on this problem.

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/
repo. Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the
lines following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you
are running gluster.
If you've installed the glusterfs-debuginfo package, you'll
also get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced glusterd
on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many
clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
fd cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I
[fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol
versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation
on
gluster volume set atlasglust
features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout
600
gluster volume set atlasglust performance.parallel-readdir
on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust
performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

mohammad kashif

2018-06-20 16:12:34 UTC

Hi Nithya

Thanks for the bug report. This new crash happened only once and only at
one client in the last 6 days. I will let you know if it happened again or
more frequently.

Cheers

Kashif

Post by Nithya Balachandran
Hi Mohammad,
This is a different crash. How often does it happen?
We have managed to reproduce the first crash you reported and a bug has
been filed at [1].
We will work on a fix for this.
Regards,
Nithya
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1593199

On Mon, Jun 18, 2018 at 9:39 AM, Raghavendra Gowdappa <

On Mon, Jun 18, 2018 at 8:11 AM, Raghavendra Gowdappa <

Post by Raghavendra Gowdappa
#8 0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#9 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
It looks like an infinite recursion. Note that readdirp is wound to
the same subvol (value of "this" is same in all calls to rda_readdirp) at
the same offset (of value 2). This may be a bug in DHT (winding down
readdirp with wrong offset) or in readdir-ahead (populating incorrect
offset values in dentries it returns as readdirp response).

It looks to be a corruption. Value of size argument in rda_readdirp is
too big (around 127 TB) to be sane. If you've a reproducer, please run it
in valgrind or ASAN.

I spoke too early. It could be a negative value and hence it may not be
a corruption. Is it possible to upload the core somewhere? Or better still
access to gdb session with this core would be more helpful.

To make it explicit, ATM its not clear that there is bug in
readdir-ahead or DHT as it looks to be a memory corruption. Till I get a
reproducer or valgrind/ASAN output of client process when the issue
occcurs, I won't be working on this problem.

Post by Raghavendra Gowdappa
On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/
repo. Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the
lines following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which
you are running gluster.
If you've installed the glusterfs-debuginfo package, you'll
also get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced
glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to
many clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
fd cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase
log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I
[fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol
versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I
[fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to
graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We
have seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation
on
gluster volume set atlasglust
features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust
performance.md-cache-timeout 600
gluster volume set atlasglust
performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust
performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest
that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Nithya Balachandran

2018-06-21 03:22:14 UTC

Thank you. In the meantime, turning off parallel readdir should prevent the
first crash.

Post by mohammad kashif
Hi Nithya
Thanks for the bug report. This new crash happened only once and only at
one client in the last 6 days. I will let you know if it happened again or
more frequently.
Cheers
Kashif

Post by mohammad kashif
Hi
Problem appeared again after few days. This time, the client
is glusterfs-3.10.12-1.el6.x86_64 and performance.parallel-readdir is
off. The log level was set to ERROR and I got this log at the time of crash
[2018-06-14 08:45:43.551384] E [rpc-clnt.c:365:saved_frames_unwind]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x7fac2e66ce03]
(--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fac2e434867]
(--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fac2e43497e]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x7fac2e434a45]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x7fac2e434d68]
))))) 0-atlasglust-client-4: forced unwinding frame type(GlusterFS 3.3)
op(READDIRP(40)) called at 2018-06-14 08:45:43.483303 (xid=0x7553c7
Core dump was enabled on client so it created a dump. It is here
http://www-pnp.physics.ox.ac.uk/~mohammad
<http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log>/core.1002074
I used a gdb trace using this command
gdb /usr/sbin/glusterfs core.1002074 -ex bt -ex quit |& tee
backtrace.log_18_16_1
http://www-pnp.physics.ox.ac.uk/~mohammad
<http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log>
/backtrace.log_18_16_1
I haven't used gdb much so let me know if you want me to run gdb in
different manner.
Thanks
Kashif
On Mon, Jun 18, 2018 at 6:27 AM, Raghavendra Gowdappa <

On Mon, Jun 18, 2018 at 9:39 AM, Raghavendra Gowdappa <

On Mon, Jun 18, 2018 at 8:11 AM, Raghavendra Gowdappa <

Post by Raghavendra Gowdappa
#8 0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#9 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized
out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
It looks like an infinite recursion. Note that readdirp is wound to
the same subvol (value of "this" is same in all calls to rda_readdirp) at
the same offset (of value 2). This may be a bug in DHT (winding down
readdirp with wrong offset) or in readdir-ahead (populating incorrect
offset values in dentries it returns as readdirp response).

It looks to be a corruption. Value of size argument in rda_readdirp is
too big (around 127 TB) to be sane. If you've a reproducer, please run it
in valgrind or ASAN.

I spoke too early. It could be a negative value and hence it may not be
a corruption. Is it possible to upload the core somewhere? Or better still
access to gdb session with this core would be more helpful.

To make it explicit, ATM its not clear that there is bug in
readdir-ahead or DHT as it looks to be a memory corruption. Till I get a
reproducer or valgrind/ASAN output of client process when the issue
occcurs, I won't be working on this problem.

Post by Raghavendra Gowdappa
On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <

Post by Milind Changire
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/
repo. Do you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <

Post by mohammad kashif
Hi Milind
The operating system is Scientific Linux 6 which is based on
RHEL6. The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the
lines following crash would have a backtrace in most cases.
HTH,
Vijay

Post by mohammad kashif
Thanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <

Post by Milind Changire
Kashif,
Could you share the core dump via Google Drive or something
similar
Also, let me know the CPU arch and OS Distribution on which
you are running gluster.
If you've installed the glusterfs-debuginfo package, you'll
also get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <

Post by mohammad kashif
Hi Milind, Vijay
Thanks, I have some more information now as I straced
glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am
not sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to
many clients.
I will really appreciate any help as whole file system has
become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <

Post by mohammad kashif
Hi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
fd cleanup on /atlas/atlasdata/zgubic/hmumu/
histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to
increase log verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I
[fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol
versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I
[fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to
graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <

Post by mohammad kashif
Hi
Since I have updated our gluster server and client to
latest version 3.12.9-1, I am having this issue of gluster getting
unmounted from client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We
have seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust
features.cache-invalidation on
gluster volume set atlasglust
features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch
on
gluster volume set atlasglust
performance.cache-invalidation on
gluster volume set atlasglust
performance.md-cache-timeout 600
gluster volume set atlasglust
performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust
performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch
on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disabl
e,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone
suggest that how to troubleshoot this issue?

Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

mohammad kashif

2018-06-12 15:29:02 UTC

Hi Vijay

I have enabled TRACE for client and there are lots of Trace messages in log
but no 'crash'

The only error I can see is about inode context is NULL

[io-cache.c:564:ioc_open_cbk] 0-atlasglust-io-cache: inode context is NULL
(748157d2-274f-4595-9bb6-afb1fb5a0642) [Invalid argument]

Kashif