[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Discussion:

Ian Halliday

2018-03-25 19:39:28 UTC

Hello all,

We are having a rather interesting problem with one of our VM storage
systems. The GlusterFS client is throwing errors relating to GFID
mismatches. We traced this down to multiple shards being present on the
gluster nodes, with different gfids.

Hypervisor gluster mount log:

[2018-03-25 18:54:19.261733] E [MSGID: 133010]
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard:
Lookup on shard 7 failed. Base file gfid =
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009]
[dht-common.c:2162:dht_lookup_linkfile_cbk] 0-ovirt-zone1-dht:
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on data
file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
[dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
subvolume ovirt-zone1-replicate-3, gfid local =
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56

On the storage nodes, we found this:

[***@n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

[***@n1 gluster]# ls -lh
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
[***@n1 gluster]# ls -lh
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

[***@n1 gluster]# getfattr -d -m . -e hex
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300

[***@n1 gluster]# getfattr -d -m . -e hex
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56

I'm wondering how they got created in the first place, and if anyone has
any insight on how to fix it?

Storage nodes:
[***@n1 gluster]# gluster --version
glusterfs 4.0.0

[***@n1 gluster]# gluster volume info

Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Bricks:
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
Options Reconfigured:
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on

Client version:
[***@kvm573 ~]# gluster --version
glusterfs 3.12.5

Thanks!

- Ian

Krutika Dhananjay

2018-03-26 07:10:54 UTC

Permalink

The gfid mismatch here is between the shard and its "link-to" file, the
creation of which happens at a layer below that of shard translator on the
stack.

Adding DHT devs to take a look.

-Krutika

Post by Ian Halliday
Hello all,
We are having a rather interesting problem with one of our VM storage
systems. The GlusterFS client is throwing errors relating to GFID
mismatches. We traced this down to multiple shards being present on the
gluster nodes, with different gfids.
[2018-03-25 18:54:19.261733] E [MSGID: 133010] [shard.c:1724:shard_common_lookup_shards_cbk]
0-ovirt-zone1-shard: Lookup on shard 7 failed. Base file gfid =
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
different on data file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node = 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
" repeated 2 times between [2018-03-25 18:54:19.253748] and [2018-03-25
18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on subvolume
ovirt-zone1-replicate-3, gfid local = fdf0813b-718a-4616-a51b-6999ebba9ec3,
gfid node = 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/
87137cac-49eb-492a-8f33-8e33470d8cb7.7
8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/
87137cac-49eb-492a-8f33-8e33470d8cb7.7
87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65
312d7265706c69636174652d3300
87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and if anyone has
any insight on how to fix it?
glusterfs 4.0.0
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on
glusterfs 3.12.5
Thanks!
- Ian
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Raghavendra Gowdappa

2018-03-26 07:25:59 UTC

Permalink

Post by Krutika Dhananjay
The gfid mismatch here is between the shard and its "link-to" file, the
creation of which happens at a layer below that of shard translator on the
stack.
Adding DHT devs to take a look.

Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based path)
internally while managing shards. Can you confirm? If it does these
operations, what fops does it do?

@Ian,

I can suggest following way to fix the problem:
* Since one of files listed is a DHT linkto file, I am assuming there is
only one shard of the file. If not, please list out gfids of other shards
and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a different
gfid, please proceed to step 3. Otherwise abort the healing procedure.
* If cluster.lookup-optimize is set to true abort the healing procedure
* Delete the linkto file - the file with permissions -------T and xattr
trusted.dht.linkto and do a lookup on the file from mount point after
turning off readdriplus [1].

As to reasons on how we ended up in this situation, Can you explain me what
is the I/O pattern on this file - like are there lots of entry operations
like rename, link, unlink etc on the file? There have been known races in
rename/lookup-heal-creating-linkto where linkto and data file have
different gfids. [2] fixes some of these cases

[1] http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html
[2] https://review.gluster.org/#/c/19547/

regards,
Raghavendra

Post by Krutika Dhananjay

Post by Krutika Dhananjay
-Krutika
Hello all,
We are having a rather interesting problem with one of our VM storage
systems. The GlusterFS client is throwing errors relating to GFID
mismatches. We traced this down to multiple shards being present on the
gluster nodes, with different gfids.
[2018-03-25 18:54:19.261733] E [MSGID: 133010]
Lookup on shard 7 failed. Base file gfid = 87137cac-49eb-492a-8f33-8e33470d8cb7
[Stale file handle]
The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
different on data file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
subvolume ovirt-zone1-replicate-3, gfid local =
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/87137cac
-49eb-492a-8f33-8e33470d8cb7.7
-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/87137cac
-49eb-492a-8f33-8e33470d8cb7.7
-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e653
12d7265706c69636174652d3300
-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and if anyone has
any insight on how to fix it?
glusterfs 4.0.0
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on
glusterfs 3.12.5
Thanks!
- Ian
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Raghavendra Gowdappa

2018-03-26 07:37:21 UTC

Permalink

Ian,

Do you've a reproducer for this bug? If not a specific one, a general
outline of what operations where done on the file will help.

regards,
Raghavendra

Post by Raghavendra Gowdappa

Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based path)
internally while managing shards. Can you confirm? If it does these
operations, what fops does it do?
@Ian,
* Since one of files listed is a DHT linkto file, I am assuming there is
only one shard of the file. If not, please list out gfids of other shards
and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a different
gfid, please proceed to step 3. Otherwise abort the healing procedure.
* If cluster.lookup-optimize is set to true abort the healing procedure
* Delete the linkto file - the file with permissions -------T and xattr
trusted.dht.linkto and do a lookup on the file from mount point after
turning off readdriplus [1].
As to reasons on how we ended up in this situation, Can you explain me
what is the I/O pattern on this file - like are there lots of entry
operations like rename, link, unlink etc on the file? There have been known
races in rename/lookup-heal-creating-linkto where linkto and data file
have different gfids. [2] fixes some of these cases
[1] http://lists.gluster.org/pipermail/gluster-users/2017-
March/030148.html
[2] https://review.gluster.org/#/c/19547/
regards,
Raghavendra

Post by Krutika Dhananjay

Post by Krutika Dhananjay
-Krutika
Hello all,
We are having a rather interesting problem with one of our VM storage
systems. The GlusterFS client is throwing errors relating to GFID
mismatches. We traced this down to multiple shards being present on the
gluster nodes, with different gfids.
[2018-03-25 18:54:19.261733] E [MSGID: 133010]
Lookup on shard 7 failed. Base file gfid = 87137cac-49eb-492a-8f33-8e33470d8cb7
[Stale file handle]
The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
different on data file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
subvolume ovirt-zone1-replicate-3, gfid local =
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/87137cac
-49eb-492a-8f33-8e33470d8cb7.7
-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/87137cac
-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e653
12d7265706c69636174652d3300
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and if anyone has
any insight on how to fix it?
glusterfs 4.0.0
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on
glusterfs 3.12.5
Thanks!
- Ian
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Ian Halliday

2018-03-26 08:39:52 UTC

Permalink

Raghavenda,

The issue typically appears during heavy write operations to the VM
image. Its most noticeable during the filesystem creation process on a
virtual machine image. I'll get some specific data while executing that
process and will get back to you soon.

thanks

-- Ian

------ Original Message ------
From: "Raghavendra Gowdappa" <***@redhat.com>
To: "Krutika Dhananjay" <***@redhat.com>
Cc: "Ian Halliday" <***@ndevix.com>; "gluster-user"
<gluster-***@gluster.org>; "Nithya Balachandran" <***@redhat.com>
Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copies
with mismatching gfids

Post by Raghavendra Gowdappa
Ian,
Do you've a reproducer for this bug? If not a specific one, a general
outline of what operations where done on the file will help.
regards,
Raghavendra
On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa

On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay

Post by Krutika Dhananjay
The gfid mismatch here is between the shard and its "link-to" file,
the creation of which happens at a layer below that of shard
translator on the stack.
Adding DHT devs to take a look.

Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based
path) internally while managing shards. Can you confirm? If it does
these operations, what fops does it do?
@Ian,
* Since one of files listed is a DHT linkto file, I am assuming there
is only one shard of the file. If not, please list out gfids of other
shards and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a
different gfid, please proceed to step 3. Otherwise abort the healing
procedure.
* If cluster.lookup-optimize is set to true abort the healing
procedure
* Delete the linkto file - the file with permissions -------T and
xattr trusted.dht.linkto and do a lookup on the file from mount point
after turning off readdriplus [1].
As to reasons on how we ended up in this situation, Can you explain me
what is the I/O pattern on this file - like are there lots of entry
operations like rename, link, unlink etc on the file? There have been
known races in rename/lookup-heal-creating-linkto where linkto and
data file have different gfids. [2] fixes some of these cases
[1]
http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html
<http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html>
[2] https://review.gluster.org/#/c/19547/
<https://review.gluster.org/#/c/19547/>
regards,
Raghavendra

Post by Krutika Dhananjay
-Krutika

Post by Ian Halliday
Hello all,
We are having a rather interesting problem with one of our VM
storage systems. The GlusterFS client is throwing errors relating to
GFID mismatches. We traced this down to multiple shards being
present on the gluster nodes, with different gfids.
[2018-03-25 18:54:19.261733] E [MSGID: 133010]
Lookup on shard 7 failed. Base file gfid =
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on
data file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
subvolume ovirt-zone1-replicate-3, gfid local =
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and if anyone
has any insight on how to fix it?
glusterfs 4.0.0
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on
glusterfs 3.12.5
Thanks!
- Ian
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>

Ian Halliday

2018-04-03 02:22:20 UTC

Permalink

Raghavendra,

Sorry for the late follow up. I have some more data on the issue.

The issue tends to happen when the shards are created. The easiest time
to reproduce this is during an initial VM disk format. This is a log
from a test VM that was launched, and then partitioned and formatted
with LVM / XFS:

[2018-04-03 02:05:00.838440] W [MSGID: 109048]
[dht-common.c:9732:dht_rmdir_cached_lookup_cbk] 0-ovirt-350-zone1-dht:
/489c6fb7-fe61-4407-8160-35c0aac40c85/images/_remove_me_9a0660e1-bd86-47ea-8e09-865c14f11f26/e2645bd1-a7f3-4cbd-9036-3d3cbc7204cd.meta
found on cached subvol ovirt-350-zone1-replicate-5
[2018-04-03 02:07:57.967489] I [MSGID: 109070]
[dht-common.c:2796:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:
Lookup of /.shard/927c6620-848b-4064-8c88-68a332b645c2.7 on
ovirt-350-zone1-replicate-3 (following linkfile) failed ,gfid =
00000000-0000-0000-0000-000000000000 [No such file or directory]
[2018-04-03 02:07:57.974815] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
[2018-04-03 02:07:57.979851] W [MSGID: 109009]
[dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980716] W [MSGID: 109009]
[dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
b1e3f299-32ff-497e-918b-090e957090f6, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980763] E [MSGID: 133010]
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:
Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.983016] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
[2018-04-03 02:07:57.988761] W [MSGID: 109009]
[dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
b1e3f299-32ff-497e-918b-090e957090f6, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.988844] W [MSGID: 109009]
[dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
955a5e78-ab4c-499a-89f8-511e041167fb
[2018-04-03 02:07:57.989748] W [MSGID: 109009]
[dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
efbb9be5-0744-4883-8f3e-e8f7ce8d7741, gfid node =
955a5e78-ab4c-499a-89f8-511e041167fb
[2018-04-03 02:07:57.989827] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret -1 and op_errno 2 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
[2018-04-03 02:07:57.989832] E [MSGID: 133010]
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:
Lookup on shard 7 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
The message "W [MSGID: 109009]
[dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb " repeated 2 times between
[2018-04-03 02:07:57.979851] and [2018-04-03 02:07:57.995739]
[2018-04-03 02:07:57.996644] W [MSGID: 109009]
[dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
0a701104-e9a2-44c0-8181-4a9a6edecf9f, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.996761] E [MSGID: 133010]
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:
Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.998986] W [MSGID: 109009]
[dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.999857] W [MSGID: 109009]
[dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht:
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
0a701104-e9a2-44c0-8181-4a9a6edecf9f, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.999899] E [MSGID: 133010]
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard:
Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.999942] W [fuse-bridge.c:896:fuse_attr_cbk]
0-glusterfs-fuse: 22338: FSTAT()
/489c6fb7-fe61-4407-8160-35c0aac40c85/images/a717e25c-f108-4367-9d28-9235bd432bb7/5a8e541e-8883-4dec-8afd-aa29f38ef502
=> -1 (Stale file handle)
[2018-04-03 02:07:57.987941] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3

Duplicate shards are created. Output from one of the gluster nodes:

# find -name 927c6620-848b-4064-8c88-68a332b645c2.*
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7

[***@n1 gluster]# getfattr -d -m . -e hex
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d3638613333326236343563322e3139
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300

[***@n1 gluster]# getfattr -d -m . -e hex
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d3638613333326236343563322e3139

In the above example, the shard on Brick 1 is the bad one.

At this point, the VM will pause with an unknown storage error and will
not boot until the offending shards are removed.

# gluster volume info
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Bricks:
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
Options Reconfigured:
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.client-io-threads: off
server.allow-insecure: on
client.event-threads: 8
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 16
features.shard-block-size: 5GB
features.shard: on
transport.address-family: inet
nfs.disable: yes

Any suggestions?

-- Ian

------ Original Message ------
From: "Raghavendra Gowdappa" <***@redhat.com>
To: "Krutika Dhananjay" <***@redhat.com>
Cc: "Ian Halliday" <***@ndevix.com>; "gluster-user"
<gluster-***@gluster.org>; "Nithya Balachandran" <***@redhat.com>
Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copies
with mismatching gfids

On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay

Post by Krutika Dhananjay
The gfid mismatch here is between the shard and its "link-to" file,
the creation of which happens at a layer below that of shard
translator on the stack.
Adding DHT devs to take a look.

Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based
path) internally while managing shards. Can you confirm? If it does
these operations, what fops does it do?
@Ian,
* Since one of files listed is a DHT linkto file, I am assuming there
is only one shard of the file. If not, please list out gfids of other
shards and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a
different gfid, please proceed to step 3. Otherwise abort the healing
procedure.
* If cluster.lookup-optimize is set to true abort the healing
procedure
* Delete the linkto file - the file with permissions -------T and
xattr trusted.dht.linkto and do a lookup on the file from mount point
after turning off readdriplus [1].
As to reasons on how we ended up in this situation, Can you explain me
what is the I/O pattern on this file - like are there lots of entry
operations like rename, link, unlink etc on the file? There have been
known races in rename/lookup-heal-creating-linkto where linkto and
data file have different gfids. [2] fixes some of these cases
[1]
http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html
<http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html>
[2] https://review.gluster.org/#/c/19547/
<https://review.gluster.org/#/c/19547/>
regards,
Raghavendra

Post by Krutika Dhananjay
-Krutika

Post by Ian Halliday
Hello all,
We are having a rather interesting problem with one of our VM
storage systems. The GlusterFS client is throwing errors relating to
GFID mismatches. We traced this down to multiple shards being
present on the gluster nodes, with different gfids.
[2018-03-25 18:54:19.261733] E [MSGID: 133010]
Lookup on shard 7 failed. Base file gfid =
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on
data file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
subvolume ovirt-zone1-replicate-3, gfid local =
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and if anyone
has any insight on how to fix it?
glusterfs 4.0.0
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on
glusterfs 3.12.5
Thanks!
- Ian
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>

Raghavendra Gowdappa

2018-04-06 03:39:47 UTC

Permalink

Sorry for the delay, Ian :).

This looks to be a genuine issue which requires some effort in fixing it.
Can you file a bug? I need following information attached to bug:

* Client and bricks logs. If you can reproduce the issue, please set
diagnostics.client-log-level and diagnostics.brick-log-level to TRACE. If
you cannot reproduce the issue or if you cannot accommodate such big logs,
please set the log-level to DEBUG.
* If possible a simple reproducer. A simple script or steps are appreciated.
* strace of VM (to find out I/O pattern). If possible, dump of traffic
between kernel and glusterfs. This can be captured by mounting glusterfs
using --dump-fuse option.

Note that the logs you've posted here captures the scenario _after_ the
shard file has gone into bad state. But I need information on what led to
that situation. So, please start collecting this diagnostic information as
early as you can.

regards,
Raghavendra

Post by Ian Halliday
Raghavendra,
Sorry for the late follow up. I have some more data on the issue.
The issue tends to happen when the shards are created. The easiest time to
reproduce this is during an initial VM disk format. This is a log from a
test VM that was launched, and then partitioned and formatted with LVM /
[2018-04-03 02:05:00.838440] W [MSGID: 109048]
/489c6fb7-fe61-4407-8160-35c0aac40c85/images/_remove_
me_9a0660e1-bd86-47ea-8e09-865c14f11f26/e2645bd1-a7f3-4cbd-9036-3d3cbc7204cd.meta
found on cached subvol ovirt-350-zone1-replicate-5
[2018-04-03 02:07:57.967489] I [MSGID: 109070]
[dht-common.c:2796:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht: Lookup
of /.shard/927c6620-848b-4064-8c88-68a332b645c2.7 on
ovirt-350-zone1-replicate-3 (following linkfile) failed ,gfid =
00000000-0000-0000-0000-000000000000 [No such file or directory]
[2018-04-03 02:07:57.974815] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
[2018-04-03 02:07:57.979851] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local = 00000000-0000-0000-0000-000000000000,
gfid node = 55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980716] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on subvolume
ovirt-350-zone1-replicate-3, gfid local = b1e3f299-32ff-497e-918b-090e957090f6,
gfid node = 55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980763] E [MSGID: 133010] [shard.c:1724:shard_common_lookup_shards_cbk]
0-ovirt-350-zone1-shard: Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.983016] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
[2018-04-03 02:07:57.988761] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on subvolume
ovirt-350-zone1-replicate-3, gfid local = b1e3f299-32ff-497e-918b-090e957090f6,
gfid node = 55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.988844] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local = 00000000-0000-0000-0000-000000000000,
gfid node = 955a5e78-ab4c-499a-89f8-511e041167fb
[2018-04-03 02:07:57.989748] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid differs on subvolume
ovirt-350-zone1-replicate-3, gfid local = efbb9be5-0744-4883-8f3e-e8f7ce8d7741,
gfid node = 955a5e78-ab4c-499a-89f8-511e041167fb
[2018-04-03 02:07:57.989827] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret -1 and op_errno 2 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
[2018-04-03 02:07:57.989832] E [MSGID: 133010] [shard.c:1724:shard_common_lookup_shards_cbk]
0-ovirt-350-zone1-shard: Lookup on shard 7 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
The message "W [MSGID: 109009] [dht-common.c:2831:dht_lookup_linkfile_cbk]
gfid different on data file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node = 55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
" repeated 2 times between [2018-04-03 02:07:57.979851] and [2018-04-03
02:07:57.995739]
[2018-04-03 02:07:57.996644] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on subvolume
ovirt-350-zone1-replicate-3, gfid local = 0a701104-e9a2-44c0-8181-4a9a6edecf9f,
gfid node = 55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.996761] E [MSGID: 133010] [shard.c:1724:shard_common_lookup_shards_cbk]
0-ovirt-350-zone1-shard: Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.998986] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local = 00000000-0000-0000-0000-000000000000,
gfid node = 55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.999857] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on subvolume
ovirt-350-zone1-replicate-3, gfid local = 0a701104-e9a2-44c0-8181-4a9a6edecf9f,
gfid node = 55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.999899] E [MSGID: 133010] [shard.c:1724:shard_common_lookup_shards_cbk]
0-ovirt-350-zone1-shard: Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.999942] W [fuse-bridge.c:896:fuse_attr_cbk]
0-glusterfs-fuse: 22338: FSTAT() /489c6fb7-fe61-4407-8160-
35c0aac40c85/images/a717e25c-f108-4367-9d28-9235bd432bb7/
5a8e541e-8883-4dec-8afd-aa29f38ef502 => -1 (Stale file handle)
[2018-04-03 02:07:57.987941] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
# find -name 927c6620-848b-4064-8c88-68a332b645c2.*
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d
346336642d393737642d3761393337616138343830362f39323763363632
302d383438622d343036342d386338382d3638613333326236343563322e3139
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65
312d7265706c69636174652d3300
927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d
346336642d393737642d3761393337616138343830362f39323763363632
302d383438622d343036342d386338382d3638613333326236343563322e3139
In the above example, the shard on Brick 1 is the bad one.
At this point, the VM will pause with an unknown storage error and will
not boot until the offending shards are removed.
# gluster volume info
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.client-io-threads: off
server.allow-insecure: on
client.event-threads: 8
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 16
features.shard-block-size: 5GB
features.shard: on
transport.address-family: inet
nfs.disable: yes
Any suggestions?
-- Ian
------ Original Message ------
Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copies with
mismatching gfids
Ian,
Do you've a reproducer for this bug? If not a specific one, a general
outline of what operations where done on the file will help.
regards,
Raghavendra
On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa <

Post by Raghavendra Gowdappa

Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based path)
internally while managing shards. Can you confirm? If it does these
operations, what fops does it do?
@Ian,
* Since one of files listed is a DHT linkto file, I am assuming there is
only one shard of the file. If not, please list out gfids of other shards
and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a
different gfid, please proceed to step 3. Otherwise abort the healing
procedure.
* If cluster.lookup-optimize is set to true abort the healing procedure
* Delete the linkto file - the file with permissions -------T and xattr
trusted.dht.linkto and do a lookup on the file from mount point after
turning off readdriplus [1].
As to reasons on how we ended up in this situation, Can you explain me
what is the I/O pattern on this file - like are there lots of entry
operations like rename, link, unlink etc on the file? There have been known
races in rename/lookup-heal-creating-linkto where linkto and data file
have different gfids. [2] fixes some of these cases
[1] http://lists.gluster.org/pipermail/gluster-users/2017-March/
030148.html
[2] https://review.gluster.org/#/c/19547/
regards,
Raghavendra

Post by Krutika Dhananjay

Post by Krutika Dhananjay
-Krutika
Hello all,
We are having a rather interesting problem with one of our VM storage
systems. The GlusterFS client is throwing errors relating to GFID
mismatches. We traced this down to multiple shards being present on the
gluster nodes, with different gfids.
[2018-03-25 18:54:19.261733] E [MSGID: 133010]
Lookup on shard 7 failed. Base file gfid = 87137cac-49eb-492a-8f33-8e33470d8cb7
[Stale file handle]
The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
gfid different on data file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
subvolume ovirt-zone1-replicate-3, gfid local =
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/87137cac
-49eb-492a-8f33-8e33470d8cb7.7
-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/87137cac
-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e653
12d7265706c69636174652d3300
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and if anyone
has any insight on how to fix it?
glusterfs 4.0.0
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on
glusterfs 3.12.5
Thanks!
- Ian
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Ian Halliday

2018-04-06 13:17:52 UTC

Permalink

Raghavendra,

Thanks! I'll get you this info within the next few days and will file a
bug report at the same time.

For what its worth, we were able to reproduce the issue on a completely
new cluster running 3.13. The IO pattern that most easily causes it to
fail is a VM image format with XFS. Formatting VMS with Ext4 will create
the additional shard files, but the GFIDs will usually match. I'm not
sure if there are supposed to be 2 identical shard filenames, with one
being empty, but they don't seem to cause VMs to pause or fail when the
GFID matches.

Both of these clusters are pure SSD (one replica 3 arbiter 1, the other
replica 3). I haven't seen any issues with our non-SSD clusters yet, but
they aren't pushed as hard.

Ian

------ Original Message ------
From: "Raghavendra Gowdappa" <***@redhat.com>
To: "Ian Halliday" <***@ndevix.com>
Cc: "Krutika Dhananjay" <***@redhat.com>; "gluster-user"
<gluster-***@gluster.org>; "Nithya Balachandran" <***@redhat.com>
Sent: 4/5/2018 10:39:47 PM
Subject: Re: Re[2]: [Gluster-users] Sharding problem - multiple shard
copies with mismatching gfids

Post by Raghavendra Gowdappa
Sorry for the delay, Ian :).
This looks to be a genuine issue which requires some effort in fixing
* Client and bricks logs. If you can reproduce the issue, please set
diagnostics.client-log-level and diagnostics.brick-log-level to TRACE.
If you cannot reproduce the issue or if you cannot accommodate such big
logs, please set the log-level to DEBUG.
* If possible a simple reproducer. A simple script or steps are
appreciated.
* strace of VM (to find out I/O pattern). If possible, dump of traffic
between kernel and glusterfs. This can be captured by mounting
glusterfs using --dump-fuse option.
Note that the logs you've posted here captures the scenario _after_ the
shard file has gone into bad state. But I need information on what led
to that situation. So, please start collecting this diagnostic
information as early as you can.
regards,
Raghavendra

Post by Ian Halliday
Raghavendra,
Sorry for the late follow up. I have some more data on the issue.
The issue tends to happen when the shards are created. The easiest
time to reproduce this is during an initial VM disk format. This is a
log from a test VM that was launched, and then partitioned and
[2018-04-03 02:05:00.838440] W [MSGID: 109048]
/489c6fb7-fe61-4407-8160-35c0aac40c85/images/_remove_me_9a0660e1-bd86-47ea-8e09-865c14f11f26/e2645bd1-a7f3-4cbd-9036-3d3cbc7204cd.meta
found on cached subvol ovirt-350-zone1-replicate-5
[2018-04-03 02:07:57.967489] I [MSGID: 109070]
Lookup of /.shard/927c6620-848b-4064-8c88-68a332b645c2.7 on
ovirt-350-zone1-replicate-3 (following linkfile) failed ,gfid =
00000000-0000-0000-0000-000000000000 [No such file or directory]
[2018-04-03 02:07:57.974815] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
[2018-04-03 02:07:57.979851] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980716] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
b1e3f299-32ff-497e-918b-090e957090f6, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980763] E [MSGID: 133010]
Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.983016] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
[2018-04-03 02:07:57.988761] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
b1e3f299-32ff-497e-918b-090e957090f6, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.988844] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
955a5e78-ab4c-499a-89f8-511e041167fb
[2018-04-03 02:07:57.989748] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
efbb9be5-0744-4883-8f3e-e8f7ce8d7741, gfid node =
955a5e78-ab4c-499a-89f8-511e041167fb
[2018-04-03 02:07:57.989827] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret -1 and op_errno 2 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
[2018-04-03 02:07:57.989832] E [MSGID: 133010]
Lookup on shard 7 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
The message "W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb " repeated 2 times between
[2018-04-03 02:07:57.979851] and [2018-04-03 02:07:57.995739]
[2018-04-03 02:07:57.996644] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
0a701104-e9a2-44c0-8181-4a9a6edecf9f, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.996761] E [MSGID: 133010]
Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.998986] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data
file on ovirt-350-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.999857] W [MSGID: 109009]
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on
subvolume ovirt-350-zone1-replicate-3, gfid local =
0a701104-e9a2-44c0-8181-4a9a6edecf9f, gfid node =
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.999899] E [MSGID: 133010]
Lookup on shard 3 failed. Base file gfid =
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.999942] W [fuse-bridge.c:896:fuse_attr_cbk]
0-glusterfs-fuse: 22338: FSTAT()
/489c6fb7-fe61-4407-8160-35c0aac40c85/images/a717e25c-f108-4367-9d28-9235bd432bb7/5a8e541e-8883-4dec-8afd-aa29f38ef502
=> -1 (Stale file handle)
[2018-04-03 02:07:57.987941] I [MSGID: 109069]
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk]
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
# find -name 927c6620-848b-4064-8c88-68a332b645c2.*
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d3638613333326236343563322e3139
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
# file: brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d3638613333326236343563322e3139
In the above example, the shard on Brick 1 is the bad one.
At this point, the VM will pause with an unknown storage error and
will not boot until the offending shards are removed.
# gluster volume info
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.client-io-threads: off
server.allow-insecure: on
client.event-threads: 8
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 16
features.shard-block-size: 5GB
features.shard: on
transport.address-family: inet
nfs.disable: yes
Any suggestions?
-- Ian
------ Original Message ------
Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copies
with mismatching gfids

On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay

Post by Krutika Dhananjay
The gfid mismatch here is between the shard and its "link-to" file,
the creation of which happens at a layer below that of shard
translator on the stack.
Adding DHT devs to take a look.

Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based
path) internally while managing shards. Can you confirm? If it does
these operations, what fops does it do?
@Ian,
* Since one of files listed is a DHT linkto file, I am assuming
there is only one shard of the file. If not, please list out gfids
of other shards and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a
different gfid, please proceed to step 3. Otherwise abort the
healing procedure.
* If cluster.lookup-optimize is set to true abort the healing procedure
* Delete the linkto file - the file with permissions -------T and
xattr trusted.dht.linkto and do a lookup on the file from mount
point after turning off readdriplus [1].
As to reasons on how we ended up in this situation, Can you explain
me what is the I/O pattern on this file - like are there lots of
entry operations like rename, link, unlink etc on the file? There
have been known races in rename/lookup-heal-creating-linkto where
linkto and data file have different gfids. [2] fixes some of these
cases
[1]
http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html
<http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html>
[2] https://review.gluster.org/#/c/19547/
<https://review.gluster.org/#/c/19547/>
regards,
Raghavendra

Post by Krutika Dhananjay
-Krutika
On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday

Post by Ian Halliday
Hello all,
We are having a rather interesting problem with one of our VM
storage systems. The GlusterFS client is throwing errors relating
to GFID mismatches. We traced this down to multiple shards being
present on the gluster nodes, with different gfids.
[2018-03-25 18:54:19.261733] E [MSGID: 133010]
Lookup on shard 7 failed. Base file gfid =
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on
data file on ovirt-zone1-replicate-3, gfid local =
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009]
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
subvolume ovirt-zone1-replicate-3, gfid local =
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
I'm wondering how they got created in the first place, and if
anyone has any insight on how to fix it?
glusterfs 4.0.0
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on
glusterfs 3.12.5
Thanks!
- Ian
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>