[Gluster-users] KVM lockups on Gluster 4.1.1

Discussion:

Walter Deignan

2018-08-14 21:24:32 UTC

I am using gluster to host KVM/QEMU images. I am seeing an intermittent
issue where access to an image will hang. I have to do a lazy dismount of
the gluster volume in order to break the lock and then reset the impacted
virtual machine.

It happened again today and I caught the events below in the client side
logs. Any thoughts on what might cause this? It seemed to begin after I
upgraded from 3.12.10 to 4.1.1 a few weeks ago.

[2018-08-14 14:22:15.549501] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549576] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549583] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume gv1-client-4 with lock owner d89caca92b7f0000 [Invalid argument]
[2018-08-14 14:22:15.549615] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume gv1-client-5 with lock owner d89caca92b7f0000 [Invalid argument]
[2018-08-14 14:52:18.726219] E [rpc-clnt.c:184:call_bail] 2-gv1-client-4:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
sent = 2018-08-14 14:22:15.699082. timeout = 1800 for 10.35.20.106:49159
[2018-08-14 14:52:18.726254] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962546] E [rpc-clnt.c:184:call_bail] 2-gv1-client-5:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
sent = 2018-08-14 14:52:18.726329. timeout = 1800 for 10.35.20.107:49164
[2018-08-14 15:22:25.962587] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019]
[afr-lk-common.c:601:is_blocking_locks_count_sufficient]
2-gv1-replicate-2: Unable to obtain blocking inode lock on even one child
for gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
connected)

Volume configuration -

Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
Options Reconfigured:
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%

-Walter Deignan
-Uline IT, Systems Architect

Claus Jeppesen

2018-08-20 07:38:19 UTC

Permalink

I think I have seen this also on our CentOS 7.5 systems using GlusterFS
4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.

Thanx,

Claus.

(*) libvirt/quemu log:
[2018-08-19 16:45:54.275830] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276156] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276159] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume glu-vol
01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 16:45:54.276183] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume glu-vol
01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout
= 1800 for
192.168.13.131:49152
[2018-08-19 17:16:03.691113] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout
= 1800 for
192.168.13.132:49152
[2018-08-19 17:46:03.856170] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
... many repeats ...
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout
= 1800 for
192.168.13.131:49152
[2018-08-19 18:16:04.022788] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout
= 1800 for
192.168.13.132:49152
[2018-08-19 18:46:04.195881] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
qemu: terminating on signal 15 from pid 507
2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0, package:
14.el7_5.6 (CentOS BuildSystem <http://bugs.centos.org>,
2018-06-27-14:13:57, x86-01.bsys.centos.org), qemu version: 1.5.3
(qemu-kvm-1.
5.3-156.el7_5.3)

At 19:37 the VM was restarted.

Post by Walter Deignan
I am using gluster to host KVM/QEMU images. I am seeing an intermittent
issue where access to an image will hang. I have to do a lazy dismount of
the gluster volume in order to break the lock and then reset the impacted
virtual machine.
It happened again today and I caught the events below in the client side
logs. Any thoughts on what might cause this? It seemed to begin after I
upgraded from 3.12.10 to 4.1.1 a few weeks ago.
[2018-08-14 14:22:15.549501] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549576] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549583] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null)
gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume
gv1-client-4 with lock owner d89caca92b7f0000 [Invalid argument]
[2018-08-14 14:22:15.549615] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null)
gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume
gv1-client-5 with lock owner d89caca92b7f0000 [Invalid argument]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
sent = 2018-08-14 14:22:15.699082. timeout = 1800 for 10.35.20.106:49159
[2018-08-14 14:52:18.726254] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Transport endpoint is not connected]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
sent = 2018-08-14 14:52:18.726329. timeout = 1800 for 10.35.20.107:49164
[2018-08-14 15:22:25.962587] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019]
Unable to obtain blocking inode lock on even one child for
gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
connected)
Volume configuration -
Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%
-Walter Deignan
-Uline IT, Systems Architect
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com

<http://www.datto.com/datto-signature/>

Amar Tumballi

2018-08-20 08:50:29 UTC

Permalink

Thanks for this report! We will look into this. This is something new we
are seeing, and not aware of a RCA yet!

-Amar

Post by Claus Jeppesen
I think I have seen this also on our CentOS 7.5 systems using GlusterFS
4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.
Thanx,
Claus.
[2018-08-19 16:45:54.275830] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276156] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276159] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume glu-vol
01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 16:45:54.276183] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume glu-vol
01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout
= 1800 for
192.168.13.131:49152
[2018-08-19 17:16:03.691113] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout
= 1800 for
192.168.13.132:49152
[2018-08-19 17:46:03.856170] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
... many repeats ...
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout
= 1800 for
192.168.13.131:49152
[2018-08-19 18:16:04.022788] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout
= 1800 for
192.168.13.132:49152
[2018-08-19 18:46:04.195881] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
qemu: terminating on signal 15 from pid 507
2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
14.el7_5.6 (CentOS BuildSystem <http://bugs.centos.org>,
2018-06-27-14:13:57, x86-01.bsys.centos.org), qemu version: 1.5.3
(qemu-kvm-1.
5.3-156.el7_5.3)
At 19:37 the VM was restarted.

Post by Walter Deignan
I am using gluster to host KVM/QEMU images. I am seeing an intermittent
issue where access to an image will hang. I have to do a lazy dismount of
the gluster volume in order to break the lock and then reset the impacted
virtual machine.
It happened again today and I caught the events below in the client side
logs. Any thoughts on what might cause this? It seemed to begin after I
upgraded from 3.12.10 to 4.1.1 a few weeks ago.
[2018-08-14 14:22:15.549501] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-4: remote operation failed [Invalid argument]
[2018-08-14 14:22:15.549576] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-5: remote operation failed [Invalid argument]
[2018-08-14 14:22:15.549583] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume gv1-client-4 with lock owner d89caca92b7f0000
[Invalid argument]
[2018-08-14 14:22:15.549615] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume gv1-client-5 with lock owner d89caca92b7f0000
[Invalid argument]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
sent = 2018-08-14 14:22:15.699082. timeout = 1800 for 10.35.20.106:49159
[2018-08-14 14:52:18.726254] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-4: remote operation failed [Transport endpoint is not
connected]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
sent = 2018-08-14 14:52:18.726329. timeout = 1800 for 10.35.20.107:49164
[2018-08-14 15:22:25.962587] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-5: remote operation failed [Transport endpoint is not
connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019] [afr-lk-common.c:601:is_
blocking_locks_count_sufficient] 2-gv1-replicate-2: Unable to obtain
blocking inode lock on even one child for gfid:24a48cae-53fe-4634-8fb7-
0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
connected)
Volume configuration -
Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%
-Walter Deignan
-Uline IT, Systems Architect_____________________
__________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com
<http://www.datto.com/datto-signature/>
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Amar Tumballi (amarts)

Walter Deignan

2018-08-20 12:50:33 UTC

Permalink

I upgraded late last week to 4.1.2. Since then I've seen several posix
health checks fail and bricks drop offline but I'm not sure if that's
related or a different root issue.

I haven't seen the issue described below re-occur on 4.1.2 yet but it was
intermittent to begin with so I'll probably need to run for a week or more
to be confident.

-Walter Deignan
-Uline IT, Systems Architect

From: "Claus Jeppesen" <***@datto.com>
To: ***@uline.com
Cc: gluster-***@gluster.org
Date: 08/20/2018 07:20 AM
Subject: Re: [Gluster-users] KVM lockups on Gluster 4.1.1

I think I have seen this also on our CentOS 7.5 systems using GlusterFS
4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.

Thanx,

Claus.

(*) libvirt/quemu log:
[2018-08-19 16:45:54.275830] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276156] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276159] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume glu-vol
01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 16:45:54.276183] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume glu-vol
01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout
= 1800 for
192.168.13.131:49152
[2018-08-19 17:16:03.691113] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout
= 1800 for
192.168.13.132:49152
[2018-08-19 17:46:03.856170] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1)
... many repeats ...
block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1)
[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout
= 1800 for
192.168.13.131:49152
[2018-08-19 18:16:04.022788] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout
= 1800 for
192.168.13.132:49152
[2018-08-19 18:46:04.195881] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted
(1)
qemu: terminating on signal 15 from pid 507
2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0, package:
14.el7_5.6 (CentOS BuildSystem <http://bugs.centos.org>,
2018-06-27-14:13:57, x86-01.bsys.centos.org), qemu version: 1.5.3
(qemu-kvm-1.
5.3-156.el7_5.3)

At 19:37 the VM was restarted.

On Wed, Aug 15, 2018 at 8:25 PM Walter Deignan <***@uline.com> wrote:
I am using gluster to host KVM/QEMU images. I am seeing an intermittent
issue where access to an image will hang. I have to do a lazy dismount of
the gluster volume in order to break the lock and then reset the impacted
virtual machine.

It happened again today and I caught the events below in the client side
logs. Any thoughts on what might cause this? It seemed to begin after I
upgraded from 3.12.10 to 4.1.1 a few weeks ago.

[2018-08-14 14:22:15.549501] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549576] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549583] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume gv1-client-4 with lock owner d89caca92b7f0000 [Invalid argument]

[2018-08-14 14:22:15.549615] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2:
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume gv1-client-5 with lock owner d89caca92b7f0000 [Invalid argument]

[2018-08-14 14:52:18.726219] E [rpc-clnt.c:184:call_bail] 2-gv1-client-4:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
sent = 2018-08-14 14:22:15.699082. timeout = 1800 for 10.35.20.106:49159
[2018-08-14 14:52:18.726254] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962546] E [rpc-clnt.c:184:call_bail] 2-gv1-client-5:
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
sent = 2018-08-14 14:52:18.726329. timeout = 1800 for 10.35.20.107:49164
[2018-08-14 15:22:25.962587] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019]
[afr-lk-common.c:601:is_blocking_locks_count_sufficient]
2-gv1-replicate-2: Unable to obtain blocking inode lock on even one child
for gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
connected)

Volume configuration -

Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
Options Reconfigured:
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%

-Walter Deignan
-Uline IT, Systems Architect
_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Claus Jeppesen
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com

Amar Tumballi

2018-08-20 15:38:30 UTC

Permalink

Post by Walter Deignan
I upgraded late last week to 4.1.2. Since then I've seen several posix
health checks fail and bricks drop offline but I'm not sure if that's
related or a different root issue.
I haven't seen the issue described below re-occur on 4.1.2 yet but it was
intermittent to begin with so I'll probably need to run for a week or more
to be confident.

Thanks for the update! We will be trying to reproduce the issue, and also
root cause based on analysis of code, but if you get us brick logs around
the time this happens, it may fasttrack the issue.

Thanks again,
Amar

Post by Walter Deignan
-Walter Deignan
-Uline IT, Systems Architect
Date: 08/20/2018 07:20 AM
Subject: Re: [Gluster-users] KVM lockups on Gluster 4.1.1
------------------------------
I think I have seen this also on our CentOS 7.5 systems using GlusterFS
4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.
Thanx,
Claus.
[2018-08-19 16:45:54.275830] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276156] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276159] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume glu-vol
01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 16:45:54.276183] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume glu-vol
01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout
= 1800 for
*192.168.13.131:49152* <http://192.168.13.131:49152/>
[2018-08-19 17:16:03.691113] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout
= 1800 for
*192.168.13.132:49152* <http://192.168.13.132:49152/>
[2018-08-19 17:46:03.856170] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
... many repeats ...
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout
= 1800 for
*192.168.13.131:49152* <http://192.168.13.131:49152/>
[2018-08-19 18:16:04.022788] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout
= 1800 for
*192.168.13.132:49152* <http://192.168.13.132:49152/>
[2018-08-19 18:46:04.195881] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
qemu: terminating on signal 15 from pid 507
2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
14.el7_5.6 (CentOS BuildSystem <*http://bugs.centos.org*
<http://bugs.centos.org/>>, 2018-06-27-14:13:57, *x86-01.bsys.centos.org*
<http://x86-01.bsys.centos.org/>), qemu version: 1.5.3 (qemu-kvm-1.
5.3-156.el7_5.3)
At 19:37 the VM was restarted.
I am using gluster to host KVM/QEMU images. I am seeing an intermittent
issue where access to an image will hang. I have to do a lazy dismount of
the gluster volume in order to break the lock and then reset the impacted
virtual machine.
It happened again today and I caught the events below in the client side
logs. Any thoughts on what might cause this? It seemed to begin after I
upgraded from 3.12.10 to 4.1.1 a few weeks ago.
[2018-08-14 14:22:15.549501] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-4: remote operation failed [Invalid argument]
[2018-08-14 14:22:15.549576] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-5: remote operation failed [Invalid argument]
[2018-08-14 14:22:15.549583] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume gv1-client-4 with lock owner d89caca92b7f0000
[Invalid argument]
[2018-08-14 14:22:15.549615] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
unlock failed on subvolume gv1-client-5 with lock owner d89caca92b7f0000
[Invalid argument]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
sent = 2018-08-14 14:22:15.699082. timeout = 1800 for *10.35.20.106:49159*
<http://10.35.20.106:49159/>
[2018-08-14 14:52:18.726254] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-4: remote operation failed [Transport endpoint is not
connected]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
sent = 2018-08-14 14:52:18.726329. timeout = 1800 for *10.35.20.107:49164*
<http://10.35.20.107:49164/>
[2018-08-14 15:22:25.962587] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-5: remote operation failed [Transport endpoint is not
connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019] [afr-lk-common.c:601:is_
blocking_locks_count_sufficient] 2-gv1-replicate-2: Unable to obtain
blocking inode lock on even one child for gfid:24a48cae-53fe-4634-8fb7-
0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
connected)
Volume configuration -
Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%
-Walter Deignan
-Uline IT, Systems Architect_____________________
__________________________
Gluster-users mailing list
*https://lists.gluster.org/mailman/listinfo/gluster-users*
<https://lists.gluster.org/mailman/listinfo/gluster-users>
--
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
*www.datto.com* <http://www.datto.com/>
<http://www.datto.com/datto-signature/>
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Amar Tumballi (amarts)

Claus Jeppesen

2018-08-21 06:50:32 UTC

Permalink

Hi Amar,

Unfortunately I do not have the GlusterFS brick logs anymore - however I do
have a hint:
I have 2 gluster (4.1.1) glusterfs volumes where I saw the issue - each has
about 10-12 VMs active.
I also have 2 addl. gluster (4.1.1) glusterfs volumes, but with only 3-4
VMs, where I did not see the
issue (and they had been running for 1-2 months).

Thanx,

Claus.

P.S. We are talking about using Gluster "URI" with qemu - I hope - e.g. like

<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source protocol='gluster' name='glu-vol03-lab/install3'>
<host name='install2.vlan13' port='24007'/>
</source>
<target dev='vda' bus='virtio'/>
</disk>

Post by Amar Tumballi

Post by Walter Deignan
-Walter Deignan
-Uline IT, Systems Architect
Date: 08/20/2018 07:20 AM
Subject: Re: [Gluster-users] KVM lockups on Gluster 4.1.1
------------------------------
I think I have seen this also on our CentOS 7.5 systems using GlusterFS
4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.
Thanx,
Claus.
[2018-08-19 16:45:54.275830] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276156] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
[2018-08-19 16:45:54.276159] E [MSGID: 108010]
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume glu-vol
01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 16:45:54.276183] E [MSGID: 108010]
path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
subvolume glu-vol
01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
[2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout
= 1800 for
*192.168.13.131:49152* <http://192.168.13.131:49152/>
[2018-08-19 17:16:03.691113] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout
= 1800 for
*192.168.13.132:49152* <http://192.168.13.132:49152/>
[2018-08-19 17:46:03.856170] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
... many repeats ...
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
[2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout
= 1800 for
*192.168.13.131:49152* <http://192.168.13.131:49152/>
[2018-08-19 18:16:04.022788] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
not connected]
[2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail]
0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout
= 1800 for
*192.168.13.132:49152* <http://192.168.13.132:49152/>
[2018-08-19 18:46:04.195881] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
not connected]
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
block I/O error in device 'drive-virtio-disk0': Operation not permitted (1)
qemu: terminating on signal 15 from pid 507
2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0,
package: 14.el7_5.6 (CentOS BuildSystem <*http://bugs.centos.org*
<http://bugs.centos.org/>>, 2018-06-27-14:13:57, *x86-01.bsys.centos.org*
<http://x86-01.bsys.centos.org/>), qemu version: 1.5.3 (qemu-kvm-1.
5.3-156.el7_5.3)
At 19:37 the VM was restarted.
I am using gluster to host KVM/QEMU images. I am seeing an intermittent
issue where access to an image will hang. I have to do a lazy dismount of
the gluster volume in order to break the lock and then reset the impacted
virtual machine.
It happened again today and I caught the events below in the client side
logs. Any thoughts on what might cause this? It seemed to begin after I
upgraded from 3.12.10 to 4.1.1 a few weeks ago.
[2018-08-14 14:22:15.549501] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549576] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Invalid argument]
[2018-08-14 14:22:15.549583] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null)
gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume
gv1-client-4 with lock owner d89caca92b7f0000 [Invalid argument]
[2018-08-14 14:22:15.549615] E [MSGID: 108010]
[afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null)
gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume
gv1-client-5 with lock owner d89caca92b7f0000 [Invalid argument]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
sent = 2018-08-14 14:22:15.699082. timeout = 1800 for
*10.35.20.106:49159* <http://10.35.20.106:49159/>
[2018-08-14 14:52:18.726254] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
operation failed [Transport endpoint is not connected]
bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
sent = 2018-08-14 14:52:18.726329. timeout = 1800 for
*10.35.20.107:49164* <http://10.35.20.107:49164/>
[2018-08-14 15:22:25.962587] E [MSGID: 114031]
[client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
operation failed [Transport endpoint is not connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019]
Unable to obtain blocking inode lock on even one child for
gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
connected)
Volume configuration -
Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%
-Walter Deignan
-Uline IT, Systems Architect
_______________________________________________
Gluster-users mailing list
*https://lists.gluster.org/mailman/listinfo/gluster-users*
<https://lists.gluster.org/mailman/listinfo/gluster-users>
--
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
*www.datto.com* <http://www.datto.com/>
<http://www.datto.com/datto-signature/>
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Amar Tumballi (amarts)

--
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com

<http://www.datto.com/datto-signature/>

Amar Tumballi

2018-08-20 09:30:23 UTC

Permalink

Hi Walter,

Do you see any warning or error on brick logs around this time?

Regards,
Amar

Post by Walter Deignan
[2018-08-14 15:22:25.962587] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
2-gv1-client-5: remote operation failed [Transport endpoint is not
connected]
[2018-08-14 15:22:25.962618] W [MSGID: 108019] [afr-lk-common.c:601:is_
blocking_locks_count_sufficient] 2-gv1-replicate-2: Unable to obtain
blocking inode lock on even one child for gfid:24a48cae-53fe-4634-8fb7-
0254c85ad672.
[2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
connected)
Volume configuration -
Volume Name: gv1
Type: Distributed-Replicate
Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Brick1: dc-vihi44:/gluster/bricks/megabrick/data
Brick2: dc-vihi45:/gluster/bricks/megabrick/data
Brick3: dc-vihi44:/gluster/bricks/brick1/data
Brick4: dc-vihi45:/gluster/bricks/brick1/data
Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
Brick6: dc-vihi45:/gluster/bricks/brick2/data
Brick7: dc-vihi44:/gluster/bricks/brick3/data
Brick8: dc-vihi45:/gluster/bricks/brick3/data
Brick9: dc-vihi44:/gluster/bricks/brick4/data
Brick10: dc-vihi45:/gluster/bricks/brick4/data
cluster.min-free-inodes: 6%
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
user.cifs: off
cluster.choose-local: off
features.shard: on
cluster.server-quorum-ratio: 51%
-Walter Deignan
-Uline IT, Systems Architect
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Amar Tumballi (amarts)