Discussion:
[Gluster-users] not healing one file
Richard Neuboeck
2017-10-25 12:10:36 UTC
Permalink
Hi Gluster Gurus,

I'm using a gluster volume as home for our users. The volume is
replica 3, running on CentOS 7, gluster version 3.10
(3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
gluster 3.10 (3.10.6-3.fc26.x86_64).

During the data backup I got an I/O error on one file. Manually
checking for this file on a client confirms this:

ls -l
romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/
ls: cannot access
'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4':
Input/output error
total 2015
-rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
-rw-------. 1 romanoch tbi 65222 Oct 17 17:57 previous.jsonlz4
-rw-------. 1 romanoch tbi 149161 Oct 1 13:46 recovery.bak
-?????????? ? ? ? ? ? recovery.baklz4

Out of curiosity I checked all the bricks for this file. It's
present there. Making a checksum shows that the file is different on
one of the three replica servers.

Querying healing information shows that the file should be healed:
# gluster volume heal home info
Brick sphere-six:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

Status: Connected
Number of entries: 1

Brick sphere-five:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

Status: Connected
Number of entries: 1

Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries: 0

Manually triggering heal doesn't report an error but also does not
heal the file.
# gluster volume heal home
Launching heal operation to perform index self heal on volume home
has been successful

Same with a full heal
# gluster volume heal home full
Launching heal operation to perform full self heal on volume home
has been successful

According to the split brain query that's not the problem:
# gluster volume heal home info split-brain
Brick sphere-six:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0

Brick sphere-five:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0

Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0


I have no idea why this situation arose in the first place and also
no idea as how to solve this problem. I would highly appreciate any
helpful feedback I can get.

The only mention in the logs matching this file is a rename operation:
/var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
09:19:11.561661] I [MSGID: 115061]
[server-rpc-fops.c:1022:server_rename_cbk] 0-home-server: 5266153:
RENAME
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.jsonlz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.baklz4), client:
romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-home-client-3-0-0,
error-xlator: home-posix [No data available]

I enabled directory quotas the same day this problem showed up but
I'm not sure how quotas could have an effect like this (maybe unless
the limit is reached but that's also not the case).

Thanks again if anyone as an idea.
Cheers
Richard
--
/dev/null
Amar Tumballi
2017-10-26 04:50:01 UTC
Permalink
Thanks for this report. This week many of the developers are at Gluster
Summit in Prague, will be checking this and respond next week. Hope that's
fine.

Thanks,
Amar
Post by Richard Neuboeck
Hi Gluster Gurus,
I'm using a gluster volume as home for our users. The volume is
replica 3, running on CentOS 7, gluster version 3.10
(3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
gluster 3.10 (3.10.6-3.fc26.x86_64).
During the data backup I got an I/O error on one file. Manually
ls -l
romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-
backups/
ls: cannot access
'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-
Input/output error
total 2015
-rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
-rw-------. 1 romanoch tbi 65222 Oct 17 17:57 previous.jsonlz4
-rw-------. 1 romanoch tbi 149161 Oct 1 13:46 recovery.bak
-?????????? ? ? ? ? ? recovery.baklz4
Out of curiosity I checked all the bricks for this file. It's
present there. Making a checksum shows that the file is different on
one of the three replica servers.
# gluster volume heal home info
Brick sphere-six:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-
backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-five:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-
backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries: 0
Manually triggering heal doesn't report an error but also does not
heal the file.
# gluster volume heal home
Launching heal operation to perform index self heal on volume home
has been successful
Same with a full heal
# gluster volume heal home full
Launching heal operation to perform full self heal on volume home
has been successful
# gluster volume heal home info split-brain
Brick sphere-six:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-five:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
I have no idea why this situation arose in the first place and also
no idea as how to solve this problem. I would highly appreciate any
helpful feedback I can get.
/var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
09:19:11.561661] I [MSGID: 115061]
RENAME
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-
backups/recovery.jsonlz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-
backups/recovery.baklz4
206366-home-client-3-0-0,
error-xlator: home-posix [No data available]
I enabled directory quotas the same day this problem showed up but
I'm not sure how quotas could have an effect like this (maybe unless
the limit is reached but that's also not the case).
Thanks again if anyone as an idea.
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Amar Tumballi
2017-10-26 04:51:59 UTC
Permalink
On a side note, try recently released health report tool, and see if it
does diagnose any issues in setup. Currently you may have to run it in all
the three machines.
Post by Amar Tumballi
Thanks for this report. This week many of the developers are at Gluster
Summit in Prague, will be checking this and respond next week. Hope that's
fine.
Thanks,
Amar
Post by Richard Neuboeck
Hi Gluster Gurus,
I'm using a gluster volume as home for our users. The volume is
replica 3, running on CentOS 7, gluster version 3.10
(3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
gluster 3.10 (3.10.6-3.fc26.x86_64).
During the data backup I got an I/O error on one file. Manually
ls -l
romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/
sessionstore-backups/
ls: cannot access
'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/
Input/output error
total 2015
-rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
-rw-------. 1 romanoch tbi 65222 Oct 17 17:57 previous.jsonlz4
-rw-------. 1 romanoch tbi 149161 Oct 1 13:46 recovery.bak
-?????????? ? ? ? ? ? recovery.baklz4
Out of curiosity I checked all the bricks for this file. It's
present there. Making a checksum shows that the file is different on
one of the three replica servers.
# gluster volume heal home info
Brick sphere-six:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/
sessionstore-backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-five:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/
sessionstore-backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries: 0
Manually triggering heal doesn't report an error but also does not
heal the file.
# gluster volume heal home
Launching heal operation to perform index self heal on volume home
has been successful
Same with a full heal
# gluster volume heal home full
Launching heal operation to perform full self heal on volume home
has been successful
# gluster volume heal home info split-brain
Brick sphere-six:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-five:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
I have no idea why this situation arose in the first place and also
no idea as how to solve this problem. I would highly appreciate any
helpful feedback I can get.
/var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
09:19:11.561661] I [MSGID: 115061]
RENAME
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/
sessionstore-backups/recovery.jsonlz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/
sessionstore-backups/recovery.baklz4
romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-
home-client-3-0-0,
error-xlator: home-posix [No data available]
I enabled directory quotas the same day this problem showed up but
I'm not sure how quotas could have an effect like this (maybe unless
the limit is reached but that's also not the case).
Thanks again if anyone as an idea.
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Karthik Subrahmanya
2017-10-26 05:41:53 UTC
Permalink
Hey Richard,

Could you share the following informations please?
1. gluster volume info <volname>
2. getfattr output of that file from all the bricks
getfattr -d -e hex -m . <brickpath/filepath>
3. glustershd & glfsheal logs

Regards,
Karthik
Post by Amar Tumballi
On a side note, try recently released health report tool, and see if it
does diagnose any issues in setup. Currently you may have to run it in all
the three machines.
Post by Amar Tumballi
Thanks for this report. This week many of the developers are at Gluster
Summit in Prague, will be checking this and respond next week. Hope that's
fine.
Thanks,
Amar
Post by Richard Neuboeck
Hi Gluster Gurus,
I'm using a gluster volume as home for our users. The volume is
replica 3, running on CentOS 7, gluster version 3.10
(3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
gluster 3.10 (3.10.6-3.fc26.x86_64).
During the data backup I got an I/O error on one file. Manually
ls -l
romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/ses
sionstore-backups/
ls: cannot access
'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/se
Input/output error
total 2015
-rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
-rw-------. 1 romanoch tbi 65222 Oct 17 17:57 previous.jsonlz4
-rw-------. 1 romanoch tbi 149161 Oct 1 13:46 recovery.bak
-?????????? ? ? ? ? ? recovery.baklz4
Out of curiosity I checked all the bricks for this file. It's
present there. Making a checksum shows that the file is different on
one of the three replica servers.
# gluster volume heal home info
Brick sphere-six:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/se
ssionstore-backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-five:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/se
ssionstore-backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries: 0
Manually triggering heal doesn't report an error but also does not
heal the file.
# gluster volume heal home
Launching heal operation to perform index self heal on volume home
has been successful
Same with a full heal
# gluster volume heal home full
Launching heal operation to perform full self heal on volume home
has been successful
# gluster volume heal home info split-brain
Brick sphere-six:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-five:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
I have no idea why this situation arose in the first place and also
no idea as how to solve this problem. I would highly appreciate any
helpful feedback I can get.
/var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
09:19:11.561661] I [MSGID: 115061]
RENAME
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/se
ssionstore-backups/recovery.jsonlz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/se
ssionstore-backups/recovery.baklz4
romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-ho
me-client-3-0-0,
error-xlator: home-posix [No data available]
I enabled directory quotas the same day this problem showed up but
I'm not sure how quotas could have an effect like this (maybe unless
the limit is reached but that's also not the case).
Thanks again if anyone as an idea.
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Richard Neuboeck
2017-10-26 07:10:16 UTC
Permalink
Hi Karthik,

thanks for taking a look at this. I'm not working with gluster long
enough to make heads or tails out of the logs. The logs are attached to
this mail and here is the other information:

# gluster volume info home

Volume Name: home
Type: Replicate
Volume ID: fe6218ae-f46b-42b3-a467-5fc6a36ad48a
Status: Started
Snapshot Count: 1
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sphere-six:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-four:/srv/gluster_home/brick
Options Reconfigured:
features.barrier: disable
cluster.quorum-type: auto
cluster.server-quorum-type: server
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-samba-metadata: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 90000
performance.cache-size: 1GB
performance.client-io-threads: on
cluster.lookup-optimize: on
cluster.readdir-optimize: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
cluster.server-quorum-ratio: 51%


[***@sphere-four ~]# getfattr -d -e hex -m .
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
# file:
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059df20a40006f989
trusted.gfid=0xda1c94b1643544b18d5b6f4654f60bf5
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001

[***@sphere-five ~]# getfattr -d -e hex -m .
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
# file:
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.home-client-4=0x000000010000000100000000
trusted.bit-rot.version=0x020000000000000059df1f310006ce63
trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001

[***@sphere-six ~]# getfattr -d -e hex -m .
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
# file:
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.home-client-4=0x000000010000000100000000
trusted.bit-rot.version=0x020000000000000059df11cd000548ec
trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001

Cheers
Richard
HeyRichard,
Could you share the following informations please?
1. gluster volume info <volname>
2. getfattr output of that file from all the bricks
getfattr -d -e hex -m . <brickpath/filepath>
3. glustershd & glfsheal logs
Regards,
Karthik
On a side note, try recently released health report tool, and see if
it does diagnose any issues in setup. Currently you may have to run
it in all the three machines.
Thanks for this report. This week many of the developers are at
Gluster Summit in Prague, will be checking this and respond next
week. Hope that's fine.
Thanks,
Amar
On 25-Oct-2017 3:07 PM, "Richard Neuboeck"
Hi Gluster Gurus,
I'm using a gluster volume as home for our users. The volume is
replica 3, running on CentOS 7, gluster version 3.10
(3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
gluster 3.10 (3.10.6-3.fc26.x86_64).
During the data backup I got an I/O error on one file. Manually
ls -l
romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/
ls: cannot access
'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
Input/output error
total 2015
-rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
-rw-------. 1 romanoch tbi 65222 Oct 17 17:57 previous.jsonlz4
-rw-------. 1 romanoch tbi 149161 Oct 1 13:46 recovery.bak
-?????????? ? ? ? ? ? recovery.baklz4
Out of curiosity I checked all the bricks for this file. It's
present there. Making a checksum shows that the file is
different on
one of the three replica servers.
# gluster volume heal home info
Brick sphere-six:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>klz4
Status: Connected
Number of entries: 1
Brick sphere-five:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>klz4
Status: Connected
Number of entries: 1
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries: 0
Manually triggering heal doesn't report an error but also does not
heal the file.
# gluster volume heal home
Launching heal operation to perform index self heal on volume home
has been successful
Same with a full heal
# gluster volume heal home full
Launching heal operation to perform full self heal on volume home
has been successful
# gluster volume heal home info split-brain
Brick sphere-six:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-five:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
I have no idea why this situation arose in the first place and also
no idea as how to solve this problem. I would highly appreciate any
helpful feedback I can get.
The only mention in the logs matching this file is a rename
/var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
09:19:11.561661] I [MSGID: 115061]
RENAME
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.jsonlz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>klz4
romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-home-client-3-0-0,
error-xlator: home-posix [No data available]
I enabled directory quotas the same day this problem showed up but
I'm not sure how quotas could have an effect like this
(maybe unless
the limit is reached but that's also not the case).
Thanks again if anyone as an idea.
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
Karthik Subrahmanya
2017-10-26 09:34:14 UTC
Permalink
Hi Richard,

Thanks for the informations. As you said there is gfid mismatch for the
file.
On brick-1 & brick-2 the gfids are same & on brick-3 the gfid is different.
This is not considered as split-brain because we have two good copies here.
Gluster 3.10 does not have a method to resolve this situation other than the
manual intervention [1]. Basically what you need to do is remove the file
and
the gfid hardlink from brick-3 (considering brick-3 entry as bad). Then when
you do a lookup for the file from mount it will recreate the entry on the
other brick.

Form 3.12 we have methods to resolve this situation with the cli option [2]
and
with favorite-child-policy [3]. For the time being you can use [1] to
resolve this
and if you can consider upgrading to 3.12 that would give you options to
handle
these scenarios.

[1]
http://docs.gluster.org/en/latest/Troubleshooting/split-brain/#fixing-directory-entry-split-brain
[2] https://review.gluster.org/#/c/17485/
[3] https://review.gluster.org/#/c/16878/

HTH,
Karthik
Post by Richard Neuboeck
Hi Karthik,
thanks for taking a look at this. I'm not working with gluster long
enough to make heads or tails out of the logs. The logs are attached to
# gluster volume info home
Volume Name: home
Type: Replicate
Volume ID: fe6218ae-f46b-42b3-a467-5fc6a36ad48a
Status: Started
Snapshot Count: 1
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-six:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-four:/srv/gluster_home/brick
features.barrier: disable
cluster.quorum-type: auto
cluster.server-quorum-type: server
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-samba-metadata: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 90000
performance.cache-size: 1GB
performance.client-io-threads: on
cluster.lookup-optimize: on
cluster.readdir-optimize: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
cluster.server-quorum-ratio: 51%
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059df20a40006f989
trusted.gfid=0xda1c94b1643544b18d5b6f4654f60bf5
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=
0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.home-client-4=0x000000010000000100000000
trusted.bit-rot.version=0x020000000000000059df1f310006ce63
trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=
0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.home-client-4=0x000000010000000100000000
trusted.bit-rot.version=0x020000000000000059df11cd000548ec
trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=
0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001
Cheers
Richard
HeyRichard,
Could you share the following informations please?
1. gluster volume info <volname>
2. getfattr output of that file from all the bricks
getfattr -d -e hex -m . <brickpath/filepath>
3. glustershd & glfsheal logs
Regards,
Karthik
On a side note, try recently released health report tool, and see if
it does diagnose any issues in setup. Currently you may have to run
it in all the three machines.
Thanks for this report. This week many of the developers are at
Gluster Summit in Prague, will be checking this and respond next
week. Hope that's fine.
Thanks,
Amar
On 25-Oct-2017 3:07 PM, "Richard Neuboeck"
Hi Gluster Gurus,
I'm using a gluster volume as home for our users. The volume
is
replica 3, running on CentOS 7, gluster version 3.10
(3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
gluster 3.10 (3.10.6-3.fc26.x86_64).
During the data backup I got an I/O error on one file.
Manually
ls -l
romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/
ls: cannot access
'romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.ba
Input/output error
total 2015
-rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
-rw-------. 1 romanoch tbi 65222 Oct 17 17:57
previous.jsonlz4
-rw-------. 1 romanoch tbi 149161 Oct 1 13:46 recovery.bak
-?????????? ? ? ? ? ?
recovery.baklz4
Out of curiosity I checked all the bricks for this file. It's
present there. Making a checksum shows that the file is
different on
one of the three replica servers.
Querying healing information shows that the file should be
# gluster volume heal home info
Brick sphere-six:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>klz4
Status: Connected
Number of entries: 1
Brick sphere-five:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>klz4
Status: Connected
Number of entries: 1
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries: 0
Manually triggering heal doesn't report an error but also
does not
heal the file.
# gluster volume heal home
Launching heal operation to perform index self heal on
volume home
has been successful
Same with a full heal
# gluster volume heal home full
Launching heal operation to perform full self heal on volume
home
has been successful
# gluster volume heal home info split-brain
Brick sphere-six:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-five:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
I have no idea why this situation arose in the first place
and also
no idea as how to solve this problem. I would highly
appreciate any
helpful feedback I can get.
The only mention in the logs matching this file is a rename
/var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-
10-23
09:19:11.561661] I [MSGID: 115061]
RENAME
/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.jsonlz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
/romanoch/.mozilla/firefox/vzzqqxrm.default-
1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>klz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.baklz4),
206366-home-client-3-0-0,
error-xlator: home-posix [No data available]
I enabled directory quotas the same day this problem showed
up but
I'm not sure how quotas could have an effect like this
(maybe unless
the limit is reached but that's also not the case).
Thanks again if anyone as an idea.
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
Richard Neuboeck
2017-10-27 07:36:14 UTC
Permalink
Hi Karthik,

the procedure you described in [1] worked perfectly. After removing the
file and the hardlink on brick-3 it got healed. Client access is restored.

Since there doesn't seem to be an access problem with Fedora's 3.10
client, I'll upgrade all servers to 3.12. Just in case.

Thank you so much your help!
All the best
Richard
Post by Karthik Subrahmanya
Hi Richard,
Thanks for the informations. As you said there is gfid mismatch for the
file.
On brick-1 & brick-2 the gfids are same & on brick-3 the gfid is different.
This is not considered as split-brain because we have two good copies here.
Gluster 3.10 does not have a method to resolve this situation other than the
manual intervention [1]. Basically what you need to do is remove the
file and
the gfid hardlink from brick-3 (considering brick-3 entry as bad). Then when
you do a lookup for the file from mount it will recreate the entry on
the other brick.
Form 3.12 we have methods to resolve this situation with the cli option
[2] and
with favorite-child-policy [3]. For the time being you can use [1] to
resolve this
and if you can consider upgrading to 3.12 that would give you options to
handle
these scenarios.
[1]
http://docs.gluster.org/en/latest/Troubleshooting/split-brain/#fixing-directory-entry-split-brain
[2] https://review.gluster.org/#/c/17485/
[3] https://review.gluster.org/#/c/16878/
HTH,
Karthik
On Thu, Oct 26, 2017 at 12:40 PM, Richard Neuboeck
Hi Karthik,
thanks for taking a look at this. I'm not working with gluster long
enough to make heads or tails out of the logs. The logs are attached to
# gluster volume info home
Volume Name: home
Type: Replicate
Volume ID: fe6218ae-f46b-42b3-a467-5fc6a36ad48a
Status: Started
Snapshot Count: 1
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: sphere-six:/srv/gluster_home/brick
Brick2: sphere-five:/srv/gluster_home/brick
Brick3: sphere-four:/srv/gluster_home/brick
features.barrier: disable
cluster.quorum-type: auto
cluster.server-quorum-type: server
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-samba-metadata: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 90000
performance.cache-size: 1GB
performance.client-io-threads: on
cluster.lookup-optimize: on
cluster.readdir-optimize: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
cluster.server-quorum-ratio: 51%
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059df20a40006f989
trusted.gfid=0xda1c94b1643544b18d5b6f4654f60bf5
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.home-client-4=0x000000010000000100000000
trusted.bit-rot.version=0x020000000000000059df1f310006ce63
trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001
/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
getfattr: Removing leading '/' from absolute path names
srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.home-client-4=0x000000010000000100000000
trusted.bit-rot.version=0x020000000000000059df11cd000548ec
trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9
trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001
trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001
Cheers
Richard
HeyRichard,
Could you share the following informations please?
1. gluster volume info <volname>
2. getfattr output of that file from all the bricks
     getfattr -d -e hex -m . <brickpath/filepath>
3. glustershd & glfsheal logs
Regards,
Karthik
     On a side note, try recently released health report tool, and see if
     it does diagnose any issues in setup. Currently you may have to run
     it in all the three machines.
         Thanks for this report. This week many of the developers are at
         Gluster Summit in Prague, will be checking this and respond next
         week. Hope that's fine.
         Thanks,
         Amar
         On 25-Oct-2017 3:07 PM, "Richard Neuboeck"
             Hi Gluster Gurus,
             I'm using a gluster volume as home for our users. The volume is
             replica 3, running on CentOS 7, gluster version 3.10
             (3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
             gluster 3.10 (3.10.6-3.fc26.x86_64).
             During the data backup I got an I/O error on one file. Manually
             ls -l
             romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/
             ls: cannot access
             'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>
             Input/output error
             total 2015
             -rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
             -rw-------. 1 romanoch tbi  65222 Oct 17 17:57 previous.jsonlz4
             -rw-------. 1 romanoch tbi 149161 Oct  1 13:46 recovery.bak
             -?????????? ? ?        ?        ?            ? recovery.baklz4
             Out of curiosity I checked all the bricks for this file. It's
             present there. Making a checksum shows that the file is
             different on
             one of the three replica servers.
             Querying healing information shows that the file should be
             # gluster volume heal home info
             Brick sphere-six:/srv/gluster_home/brick
             /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>
             <http://recovery.ba>klz4
             Status: Connected
             Number of entries: 1
             Brick sphere-five:/srv/gluster_home/brick
             /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>
             <http://recovery.ba>klz4
             Status: Connected
             Number of entries: 1
             Brick sphere-four:/srv/gluster_home/brick
             Status: Connected
             Number of entries: 0
             Manually triggering heal doesn't report an error but also
             does not
             heal the file.
             # gluster volume heal home
             Launching heal operation to perform index self heal on
             volume home
             has been successful
             Same with a full heal
             # gluster volume heal home full
             Launching heal operation to perform full self heal on
volume
             home
             has been successful
             # gluster volume heal home info split-brain
             Brick sphere-six:/srv/gluster_home/brick
             Status: Connected
             Number of entries in split-brain: 0
             Brick sphere-five:/srv/gluster_home/brick
             Status: Connected
             Number of entries in split-brain: 0
             Brick sphere-four:/srv/gluster_home/brick
             Status: Connected
             Number of entries in split-brain: 0
             I have no idea why this situation arose in the first place
             and also
             no idea as how to solve this problem. I would highly
             appreciate any
             helpful feedback I can get.
             The only mention in the logs matching this file is a
rename
           
 /var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
             09:19:11.561661] I [MSGID: 115061]
             RENAME
           
 /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.jsonlz4
             (48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
           
 /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba
<http://recovery.ba>
             <http://recovery.ba>klz4
             romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-home-client-3-0-0,
             error-xlator: home-posix [No data available]
             I enabled directory quotas the same day this problem showed
             up but
             I'm not sure how quotas could have an effect like this
             (maybe unless
             the limit is reached but that's also not the case).
             Thanks again if anyone as an idea.
             Cheers
             Richard
             --
             /dev/null
             _______________________________________________
             Gluster-users mailing list
           
 http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
             <http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>>
     _______________________________________________
     Gluster-users mailing list
     http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
     <http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>>
Richard Neuboeck
2017-10-26 07:37:39 UTC
Permalink
Hi Amar,

thanks for the information! I tried this tool on all machines.

# gluster-health-report

Loaded reports: glusterd-op-version, georep, gfid-mismatch-dht-report,
glusterd-peer-disconnect, disk_usage, errors_in_logs, coredump,
glusterd, glusterd_volume_version_cksum_errors, kernel_issues,
errors_in_logs, ifconfig, nic-health, process_status

[ OK] Disk used percentage path=/ percentage=4
[ OK] Disk used percentage path=/var percentage=4
[ OK] Disk used percentage path=/tmp percentage=4
[ OK] All peers are in connected state connected_count=2
total_peer_count=2
[ OK] no gfid mismatch
[ ERROR] Report failure report=report_check_glusterd_op_version
[ NOT OK] The maximum size of core files created is NOT set to unlimited.
[ ERROR] Report failure report=report_check_worker_restarts
[ ERROR] Report failure report=report_non_participating_bricks
[WARNING] Glusterd uptime is less than 24 hours uptime_sec=72798
[WARNING] Errors in Glusterd log file num_errors=35
[WARNING] Warnings in Glusterd log file num_warning=37
[ NOT OK] Recieve errors in "ifconfig bond0" output
[ NOT OK] Errors seen in "cat /proc/net/dev -- bond0" output
High CPU usage by Self-heal
[WARNING] Errors in Glusterd log file num_errors=77
[WARNING] Warnings in Glusterd log file num_warnings=61

Basically it's the same message on all of them with varying error and
warning counts.
Glusterd is not up for long since I updated and then rebootet the
machines yesterday. That's also the reason for some of the errors and
warnings and also for the network errors since it always takes some time
until the bonded device (4x1Gbit, balanced alb) is fully functional.

From what I've seen in the getfattr output Karthik asked me to get GFIDs
are different on the file in question. Even though the report says there
is no mismatch.

So is this a split-brain situation gluster is not aware of?

Cheers
Richard
Post by Amar Tumballi
On a side note, try recently released health report tool, and see if it
does diagnose any issues in setup. Currently you may have to run it in
all the three machines.
Thanks for this report. This week many of the developers are at
Gluster Summit in Prague, will be checking this and respond next
week. Hope that's fine.
Thanks,
Amar
Hi Gluster Gurus,
I'm using a gluster volume as home for our users. The volume is
replica 3, running on CentOS 7, gluster version 3.10
(3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
gluster 3.10 (3.10.6-3.fc26.x86_64).
During the data backup I got an I/O error on one file. Manually
ls -l
romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/
ls: cannot access
Input/output error
total 2015
-rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
-rw-------. 1 romanoch tbi 65222 Oct 17 17:57 previous.jsonlz4
-rw-------. 1 romanoch tbi 149161 Oct 1 13:46 recovery.bak
-?????????? ? ? ? ? ? recovery.baklz4
Out of curiosity I checked all the bricks for this file. It's
present there. Making a checksum shows that the file is different on
one of the three replica servers.
# gluster volume heal home info
Brick sphere-six:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-five:/srv/gluster_home/brick
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
Status: Connected
Number of entries: 1
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries: 0
Manually triggering heal doesn't report an error but also does not
heal the file.
# gluster volume heal home
Launching heal operation to perform index self heal on volume home
has been successful
Same with a full heal
# gluster volume heal home full
Launching heal operation to perform full self heal on volume home
has been successful
# gluster volume heal home info split-brain
Brick sphere-six:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-five:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
Brick sphere-four:/srv/gluster_home/brick
Status: Connected
Number of entries in split-brain: 0
I have no idea why this situation arose in the first place and also
no idea as how to solve this problem. I would highly appreciate any
helpful feedback I can get.
/var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
09:19:11.561661] I [MSGID: 115061]
RENAME
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.jsonlz4
(48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-home-client-3-0-0,
error-xlator: home-posix [No data available]
I enabled directory quotas the same day this problem showed up but
I'm not sure how quotas could have an effect like this (maybe unless
the limit is reached but that's also not the case).
Thanks again if anyone as an idea.
Cheers
Richard
--
/dev/null
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
Loading...