From the bt:
#8 0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#9 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
#22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0,
this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
dht-common.c:5388
It looks like an infinite recursion. Note that readdirp is wound to the
same subvol (value of "this" is same in all calls to rda_readdirp) at the
same offset (of value 2). This may be a bug in DHT (winding down readdirp
with wrong offset) or in readdir-ahead (populating incorrect offset values
in dentries it returns as readdirp response).
Post by mohammad kashifHi Milind
Thanks a lot, I manage to run gdb and produced traceback as well. Its here
http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
I am trying to understand but still not able to make sense out of it.
Thanks
Kashif
Post by Milind ChangireKashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
Post by mohammad kashifHi Milind
There is no glusterfs-debuginfo available for gluster-3.12 from
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo. Do
you know from where I can get it?
Also when I run gdb, it says
Missing separate debuginfos, use: debuginfo-install
glusterfs-fuse-3.12.9-1.el6.x86_64
I can't find debug package for glusterfs-fuse either
Thanks from the pit of despair ;)
Kashif
Post by mohammad kashifHi Milind
I will send you links for logs.
I collected these core dumps at client and there is no glusterd process
running on client.
Kashif
Post by Milind ChangireKashif,
Could you also send over the client/mount log file as Vijay suggested ?
Or maybe the lines with the crash backtrace lines
Also, you've mentioned that you straced glusterd, but when you ran
gdb, you ran it over /usr/sbin/glusterfs
On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <
Post by mohammad kashifHi Milind
The operating system is Scientific Linux 6 which is based on RHEL6.
The cpu arch is Intel x86_64.
I will send you a separate email with link to core dump.
You could also grep for crash in the client log file and the lines
following crash would have a backtrace in most cases.
HTH,
Vijay
Post by mohammad kashifThanks for your help.
Kashif
On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <
Post by Milind ChangireKashif,
Could you share the core dump via Google Drive or something similar
Also, let me know the CPU arch and OS Distribution on which you are
running gluster.
If you've installed the glusterfs-debuginfo package, you'll also
get the source lines in the backtrace via gdb
On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <
Post by mohammad kashifHi Milind, Vijay
Thanks, I have some more information now as I straced glusterd on client
138544 0.000131 mprotect(0x7f2f70785000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000026>
138544 0.000128 mprotect(0x7f2f70786000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000126 mprotect(0x7f2f70787000, 4096,
PROT_READ|PROT_WRITE) = 0 <0.000027>
138544 0.000124 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544 0.000051 --- SIGSEGV {si_signo=SIGSEGV,
si_code=SI_KERNEL, si_addr=0} ---
138551 0.105048 +++ killed by SIGSEGV (core dumped) +++
138550 0.000041 +++ killed by SIGSEGV (core dumped) +++
138547 0.000008 +++ killed by SIGSEGV (core dumped) +++
138546 0.000007 +++ killed by SIGSEGV (core dumped) +++
138545 0.000007 +++ killed by SIGSEGV (core dumped) +++
138544 0.000008 +++ killed by SIGSEGV (core dumped) +++
138543 0.000007 +++ killed by SIGSEGV (core dumped) +++
As for I understand that somehow gluster is trying to access
memory in appropriate manner and kernel sends SIGSEGV
I also got the core dump. I am trying gdb first time so I am not
sure whether I am using it correctly
gdb /usr/sbin/glusterfs core.138536
It just tell me that program terminated with signal 11,
segmentation fault .
The problem is not limited to one client but happening to many clients.
I will really appreciate any help as whole file system has become unusable
Thanks
Kashif
On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <
Post by Milind ChangireKashif,
$ gluster volume set <vol> diagnostics.brick-log-level TRACE
$ gluster volume set <vol> diagnostics.client-log-level TRACE
and see how things fare
If you want fewer logs you can change the log-level to DEBUG
instead of TRACE.
On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif <
Post by mohammad kashifHi Vijay
Now it is unmounting every 30 mins !
The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
have this line only
2018-06-12 09:53:19.303102] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
down connection <server-name> -2224879-2018/06/12-09:51:01:4
60889-atlasglust-client-0-0-0
There is no other information. Is there any way to increase log
verbosity?
on the client
2018-06-12 09:51:01.744980] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057]
[client-handshake.c:1478:select_server_supported_programs]
0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046]
[client-handshake.c:1231:client_setvolume_cbk]
0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047]
[client-handshake.c:1242:client_setvolume_cbk]
0-atlasglust-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk]
0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
0-fuse: switched to graph 0
is there a problem with server and client 1k version?
Thanks for your help.
Kashif
On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <
On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <
Post by mohammad kashifHi
Since I have updated our gluster server and client to latest
version 3.12.9-1, I am having this issue of gluster getting unmounted from
client very regularly. It was not a problem before update.
Its a distributed file system with no replication. We have
seven servers totaling around 480TB data. Its 97% full.
I am using following config on server
gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout
600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4
clients are mounted with this option
defaults,direct-io-mode=disable,attribute-timeout=600,entry-
timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
I can't see anything in the log file. Can someone suggest that
how to troubleshoot this issue?
Can you please share the log file? Checking for messages
related to disconnections/crashes in the log file would be a good way to
start troubleshooting the problem.
Thanks,
Vijay
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Milind
--
Milind
--
Milind
--
Milind
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users