Discussion:
[Gluster-users] Files not healing & missing their extended attributes - Help!
Gambit15
2018-07-01 18:20:16 UTC
Permalink
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number of
the files on my gluster share have been stuck, marked as healing. After no
signs of progress, I manually set off a full heal last night, but after
24hrs, nothing's happened.

The gluster logs all look normal, and there're no messages about failed
connections or heal processes kicking off.

I checked the listed files' extended attributes on their bricks today, and
they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.

I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.

I've got 16 hours to get this sorted before the start of work, Monday. Help!
Ashish Pandey
2018-07-01 19:34:13 UTC
Permalink
You have not even talked about the volume type and configuration and this issue would require lot of other information to fix it.

1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal

----
Ashish

----- Original Message -----

From: "Gambit15" <dougti+***@gmail.com>
To: "gluster-users" <gluster-***@gluster.org>
Sent: Sunday, July 1, 2018 11:50:16 PM
Subject: [Gluster-users] Files not healing & missing their extended attributes - Help!

Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number of the files on my gluster share have been stuck, marked as healing. After no signs of progress, I manually set off a full heal last night, but after 24hrs, nothing's happened.

The gluster logs all look normal, and there're no messages about failed connections or heal processes kicking off.

I checked the listed files' extended attributes on their bricks today, and they only show the selinux attribute. There's none of the trusted.* attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.

I'm guessing that perhaps the files somehow lost their attributes, and gluster is no longer able to work out what to do with them? It's not logged any errors, warnings, or anything else out of the normal though, so I've no idea what the problem is or how to resolve it.

I've got 16 hours to get this sorted before the start of work, Monday. Help!

_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users
Gambit15
2018-07-01 20:15:01 UTC
Permalink
Hi Ashish,

The output is below. It's a rep 2+1 volume. The arbiter is offline for
maintenance at the moment, however quorum is met & no files are reported as
in split-brain (it hosts VMs, so files aren't accessed concurrently).

======================
[***@v0 glusterfs]# gluster volume info engine

Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32

======================

[***@v0 glusterfs]# gluster volume heal engine info
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34

Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34

Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -

======================
=== PEER V0 ===

[***@v0 glusterfs]# getfattr -m . -d -e hex
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[***@v0 glusterfs]# getfattr -m . -d -e hex
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file:
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000

# file:
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000


=== PEER V1 ===

[***@v1 glusterfs]# getfattr -m . -d -e hex
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

======================

cmd_history.log-20180701:

[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS

glustershd.log-20180701:
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011]
[glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from
server...

glustershd.log:
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing

That's the *only* message in glustershd.log today.

======================

[***@v0 glusterfs]# gluster volume status engine
Status of volume: engine
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick s0:/gluster/engine/brick 49154 0 Y
2816
Brick s1:/gluster/engine/brick 49154 0 Y
3995
Self-heal Daemon on localhost N/A N/A Y
2919
Self-heal Daemon on s1 N/A N/A Y
4013

Task Status of Volume engine
------------------------------------------------------------------------------
There are no active volume tasks

======================

Okay, so actually only the directory ha_agent is listed for healing (not
its contents), & that does have attributes set.

Many thanks for the reply!
Post by Ashish Pandey
You have not even talked about the volume type and configuration and this
issue would require lot of other information to fix it.
1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal
----
Ashish
------------------------------
*Sent: *Sunday, July 1, 2018 11:50:16 PM
*Subject: *[Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number of
the files on my gluster share have been stuck, marked as healing. After no
signs of progress, I manually set off a full heal last night, but after
24hrs, nothing's happened.
The gluster logs all look normal, and there're no messages about failed
connections or heal processes kicking off.
I checked the listed files' extended attributes on their bricks today, and
they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.
I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.
I've got 16 hours to get this sorted before the start of work, Monday. Help!
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Ashish Pandey
2018-07-02 02:37:26 UTC
Permalink
The only problem at the moment is that arbiter brick offline. You should only bother about completion of maintenance of arbiter brick ASAP.
Bring this brick UP, start FULL heal or index heal and the volume will be in healthy state.

---
Ashish


----- Original Message -----

From: "Gambit15" <dougti+***@gmail.com>
To: "Ashish Pandey" <***@redhat.com>
Cc: "gluster-users" <gluster-***@gluster.org>
Sent: Monday, July 2, 2018 1:45:01 AM
Subject: Re: [Gluster-users] Files not healing & missing their extended attributes - Help!

Hi Ashish,

The output is below. It's a rep 2+1 volume. The arbiter is offline for maintenance at the moment, however quorum is met & no files are reported as in split-brain (it hosts VMs, so files aren't accessed concurrently).

======================
[***@v0 glusterfs]# gluster volume info engine

Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32

======================

[***@v0 glusterfs]# gluster volume heal engine info
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34

Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34

Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -

======================
=== PEER V0 ===

[***@v0 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[***@v0 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000

# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000

=== PEER V1 ===

[***@v1 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

======================

cmd_history.log-20180701:

[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS

glustershd.log-20180701:
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from server...

glustershd.log:
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing

That's the *only* message in glustershd.log today.

======================

[***@v0 glusterfs]# gluster volume status engine
Status of volume: engine
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick s0:/gluster/engine/brick 49154 0 Y 2816
Brick s1:/gluster/engine/brick 49154 0 Y 3995
Self-heal Daemon on localhost N/A N/A Y 2919
Self-heal Daemon on s1 N/A N/A Y 4013

Task Status of Volume engine
------------------------------------------------------------------------------
There are no active volume tasks

======================

Okay, so actually only the directory ha_agent is listed for healing (not its contents), & that does have attributes set.

Many thanks for the reply!


On 1 July 2018 at 15:34, Ashish Pandey < ***@redhat.com > wrote:



You have not even talked about the volume type and configuration and this issue would require lot of other information to fix it.

1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal

----
Ashish


From: "Gambit15" < dougti+***@gmail.com >
To: "gluster-users" < gluster-***@gluster.org >
Sent: Sunday, July 1, 2018 11:50:16 PM
Subject: [Gluster-users] Files not healing & missing their extended attributes - Help!


Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number of the files on my gluster share have been stuck, marked as healing. After no signs of progress, I manually set off a full heal last night, but after 24hrs, nothing's happened.

The gluster logs all look normal, and there're no messages about failed connections or heal processes kicking off.

I checked the listed files' extended attributes on their bricks today, and they only show the selinux attribute. There's none of the trusted.* attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.

I'm guessing that perhaps the files somehow lost their attributes, and gluster is no longer able to work out what to do with them? It's not logged any errors, warnings, or anything else out of the normal though, so I've no idea what the problem is or how to resolve it.

I've got 16 hours to get this sorted before the start of work, Monday. Help!

_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users






_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users
Gambit15
2018-07-03 20:27:45 UTC
Permalink
Post by Ashish Pandey
The only problem at the moment is that arbiter brick offline. You should
only bother about completion of maintenance of arbiter brick ASAP.
Bring this brick UP, start FULL heal or index heal and the volume will be in healthy state.
Doesn't the arbiter only resolve split-brain situations? None of the files
that have been marked for healing are marked as in split-brain.

The arbiter has now been brought back up, however the problem continues.

I've found the following information in the client log:

[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.metadata
5e95ba8c-2f12-49bf-be2d-b4baf210d366 on engine-client-1 and
b9cd7613-3b96-415d-a549-1dc788a4f94d on engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata => -1
(Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.lockspace
8e86902a-c31c-4990-b0c5-0318807edb8f on engine-client-1 and
e5899a4c-dc5d-487e-84b0-9bbc73133c25 on engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)

As you can see from the logs I posted previously, neither of those two
files, on either of the two servers, have any of gluster's extended
attributes set.

The arbiter doesn't have any record of the files in question, as they were
created after it went offline.

How do I fix this? Is it possible to locate the correct gfids somewhere &
redefine them on the files manually?

Cheers,
Doug

------------------------------
Post by Ashish Pandey
*Sent: *Monday, July 2, 2018 1:45:01 AM
*Subject: *Re: [Gluster-users] Files not healing & missing their extended
attributes - Help!
Hi Ashish,
The output is below. It's a rep 2+1 volume. The arbiter is offline for
maintenance at the moment, however quorum is met & no files are reported as
in split-brain (it hosts VMs, so files aren't accessed concurrently).
======================
Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32
======================
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -
======================
=== PEER V0 ===
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563
745f723a6675736566735f743a733000
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563
745f723a6675736566735f743a733000
=== PEER V1 ===
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
======================
[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate]
0-glusterfsd: Fetching the volume file from server...
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
That's the *only* message in glustershd.log today.
======================
Status of volume: engine
Gluster process TCP Port RDMA Port Online
Pid
------------------------------------------------------------
------------------
Brick s0:/gluster/engine/brick 49154 0 Y
2816
Brick s1:/gluster/engine/brick 49154 0 Y
3995
Self-heal Daemon on localhost N/A N/A Y
2919
Self-heal Daemon on s1 N/A N/A Y
4013
Task Status of Volume engine
------------------------------------------------------------
------------------
There are no active volume tasks
======================
Okay, so actually only the directory ha_agent is listed for healing (not
its contents), & that does have attributes set.
Many thanks for the reply!
Post by Ashish Pandey
You have not even talked about the volume type and configuration and this
issue would require lot of other information to fix it.
1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal
----
Ashish
------------------------------
*Sent: *Sunday, July 1, 2018 11:50:16 PM
*Subject: *[Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number
of the files on my gluster share have been stuck, marked as healing. After
no signs of progress, I manually set off a full heal last night, but after
24hrs, nothing's happened.
The gluster logs all look normal, and there're no messages about failed
connections or heal processes kicking off.
I checked the listed files' extended attributes on their bricks today,
and they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.
I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.
I've got 16 hours to get this sorted before the start of work, Monday. Help!
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Vlad Kopylov
2018-07-04 03:37:38 UTC
Permalink
might be too late but sort of simple always working solution for such cases
is rebuilding .glusterfs

kill it and query attr for all files again, it will recreate .glusterfs on
all bricks

something like mentioned here
https://lists.gluster.org/pipermail/gluster-users/2018-January/033352.html
Post by Gambit15
Post by Ashish Pandey
The only problem at the moment is that arbiter brick offline. You should
only bother about completion of maintenance of arbiter brick ASAP.
Bring this brick UP, start FULL heal or index heal and the volume will be
in healthy state.
Doesn't the arbiter only resolve split-brain situations? None of the files
that have been marked for healing are marked as in split-brain.
The arbiter has now been brought back up, however the problem continues.
[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366
on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on
engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-
0aa70860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f
on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on
engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-
0aa70860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error)
As you can see from the logs I posted previously, neither of those two
files, on either of the two servers, have any of gluster's extended
attributes set.
The arbiter doesn't have any record of the files in question, as they were
created after it went offline.
How do I fix this? Is it possible to locate the correct gfids somewhere &
redefine them on the files manually?
Cheers,
Doug
------------------------------
Post by Ashish Pandey
*Sent: *Monday, July 2, 2018 1:45:01 AM
*Subject: *Re: [Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Ashish,
The output is below. It's a rep 2+1 volume. The arbiter is offline for
maintenance at the moment, however quorum is met & no files are reported as
in split-brain (it hosts VMs, so files aren't accessed concurrently).
======================
Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32
======================
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -
======================
=== PEER V0 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a66757
36566735f743a733000
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
=== PEER V1 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
======================
[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011]
[glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file
from server...
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
That's the *only* message in glustershd.log today.
======================
Status of volume: engine
Gluster process TCP Port RDMA Port Online
Pid
------------------------------------------------------------
------------------
Brick s0:/gluster/engine/brick 49154 0 Y
2816
Brick s1:/gluster/engine/brick 49154 0 Y
3995
Self-heal Daemon on localhost N/A N/A Y
2919
Self-heal Daemon on s1 N/A N/A Y
4013
Task Status of Volume engine
------------------------------------------------------------
------------------
There are no active volume tasks
======================
Okay, so actually only the directory ha_agent is listed for healing (not
its contents), & that does have attributes set.
Many thanks for the reply!
Post by Ashish Pandey
You have not even talked about the volume type and configuration and
this issue would require lot of other information to fix it.
1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal
----
Ashish
------------------------------
*Sent: *Sunday, July 1, 2018 11:50:16 PM
*Subject: *[Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number
of the files on my gluster share have been stuck, marked as healing. After
no signs of progress, I manually set off a full heal last night, but after
24hrs, nothing's happened.
The gluster logs all look normal, and there're no messages about failed
connections or heal processes kicking off.
I checked the listed files' extended attributes on their bricks today,
and they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.
I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.
I've got 16 hours to get this sorted before the start of work, Monday. Help!
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Gambit15
2018-07-04 23:50:50 UTC
Permalink
Post by Vlad Kopylov
might be too late but sort of simple always working solution for such
cases is rebuilding .glusterfs
kill it and query attr for all files again, it will recreate .glusterfs on
all bricks
something like mentioned here
https://lists.gluster.org/pipermail/gluster-users/2018-January/033352.html
Is my problem with .glusterfs though? I'd be super cautious removing the
entire directory unless I'm sure that's the solution...

Cheers,
Post by Vlad Kopylov
Post by Gambit15
Post by Ashish Pandey
The only problem at the moment is that arbiter brick offline. You should
only bother about completion of maintenance of arbiter brick ASAP.
Bring this brick UP, start FULL heal or index heal and the volume will
be in healthy state.
Doesn't the arbiter only resolve split-brain situations? None of the
files that have been marked for healing are marked as in split-brain.
The arbiter has now been brought back up, however the problem continues.
[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366
on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on
engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70
860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f
on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on
engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70
860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error)
As you can see from the logs I posted previously, neither of those two
files, on either of the two servers, have any of gluster's extended
attributes set.
The arbiter doesn't have any record of the files in question, as they
were created after it went offline.
How do I fix this? Is it possible to locate the correct gfids somewhere &
redefine them on the files manually?
Cheers,
Doug
------------------------------
Post by Ashish Pandey
*Sent: *Monday, July 2, 2018 1:45:01 AM
*Subject: *Re: [Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Ashish,
The output is below. It's a rep 2+1 volume. The arbiter is offline for
maintenance at the moment, however quorum is met & no files are reported as
in split-brain (it hosts VMs, so files aren't accessed concurrently).
======================
Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32
======================
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -
======================
=== PEER V0 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a66757
36566735f743a733000
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
=== PEER V1 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
======================
[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011]
[glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file
from server...
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
That's the *only* message in glustershd.log today.
======================
Status of volume: engine
Gluster process TCP Port RDMA Port
Online Pid
------------------------------------------------------------
------------------
Brick s0:/gluster/engine/brick 49154 0
Y 2816
Brick s1:/gluster/engine/brick 49154 0
Y 3995
Self-heal Daemon on localhost N/A N/A Y
2919
Self-heal Daemon on s1 N/A N/A Y
4013
Task Status of Volume engine
------------------------------------------------------------
------------------
There are no active volume tasks
======================
Okay, so actually only the directory ha_agent is listed for healing (not
its contents), & that does have attributes set.
Many thanks for the reply!
Post by Ashish Pandey
You have not even talked about the volume type and configuration and
this issue would require lot of other information to fix it.
1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal
----
Ashish
------------------------------
*Sent: *Sunday, July 1, 2018 11:50:16 PM
*Subject: *[Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number
of the files on my gluster share have been stuck, marked as healing. After
no signs of progress, I manually set off a full heal last night, but after
24hrs, nothing's happened.
The gluster logs all look normal, and there're no messages about failed
connections or heal processes kicking off.
I checked the listed files' extended attributes on their bricks today,
and they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.
I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.
I've got 16 hours to get this sorted before the start of work, Monday. Help!
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Karthik Subrahmanya
2018-07-04 09:26:45 UTC
Permalink
Hi,

From the logs you have pasted it looks like those files are in GFID
split-brain.
They should have the GFIDs assigned on both the data bricks but they will
be different.

Can you please paste the getfattr output of those files and their parent
from all the bricks again?
Which version of gluster you are using?

If you are using a version higher than or equal to 3.12 gfid split brains
can be resolved using the methods (except method 4)
explained in the "Resolution of split-brain using gluster CLI" section in
[1].
Also note that for gfid split-brain resolution using CLI you have to pass
the name of the file as argument and not the GFID.

If it is lower than 3.12 (Please consider upgrading them since they are
EOL) you have to resolve it manually as explained in [2]

[1] https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/
[2]
https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/#dir-split-brain

Thanks & Regards,
Karthik
Post by Gambit15
Post by Ashish Pandey
The only problem at the moment is that arbiter brick offline. You should
only bother about completion of maintenance of arbiter brick ASAP.
Bring this brick UP, start FULL heal or index heal and the volume will be
in healthy state.
Doesn't the arbiter only resolve split-brain situations? None of the files
that have been marked for healing are marked as in split-brain.
The arbiter has now been brought back up, however the problem continues.
[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.metadata
5e95ba8c-2f12-49bf-be2d-b4baf210d366 on engine-client-1 and
b9cd7613-3b96-415d-a549-1dc788a4f94d on engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata => -1
(Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.lockspace
8e86902a-c31c-4990-b0c5-0318807edb8f on engine-client-1 and
e5899a4c-dc5d-487e-84b0-9bbc73133c25 on engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
As you can see from the logs I posted previously, neither of those two
files, on either of the two servers, have any of gluster's extended
attributes set.
The arbiter doesn't have any record of the files in question, as they were
created after it went offline.
How do I fix this? Is it possible to locate the correct gfids somewhere &
redefine them on the files manually?
Cheers,
Doug
------------------------------
Post by Ashish Pandey
*Sent: *Monday, July 2, 2018 1:45:01 AM
*Subject: *Re: [Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Ashish,
The output is below. It's a rep 2+1 volume. The arbiter is offline for
maintenance at the moment, however quorum is met & no files are reported as
in split-brain (it hosts VMs, so files aren't accessed concurrently).
======================
Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32
======================
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -
======================
=== PEER V0 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
=== PEER V1 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
======================
[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011]
[glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from
server...
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
That's the *only* message in glustershd.log today.
======================
Status of volume: engine
Gluster process TCP Port RDMA Port Online
Pid
------------------------------------------------------------------------------
Brick s0:/gluster/engine/brick 49154 0 Y
2816
Brick s1:/gluster/engine/brick 49154 0 Y
3995
Self-heal Daemon on localhost N/A N/A Y
2919
Self-heal Daemon on s1 N/A N/A Y
4013
Task Status of Volume engine
------------------------------------------------------------------------------
There are no active volume tasks
======================
Okay, so actually only the directory ha_agent is listed for healing (not
its contents), & that does have attributes set.
Many thanks for the reply!
Post by Ashish Pandey
You have not even talked about the volume type and configuration and
this issue would require lot of other information to fix it.
1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal
----
Ashish
------------------------------
*Sent: *Sunday, July 1, 2018 11:50:16 PM
*Subject: *[Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number
of the files on my gluster share have been stuck, marked as healing. After
no signs of progress, I manually set off a full heal last night, but after
24hrs, nothing's happened.
The gluster logs all look normal, and there're no messages about failed
connections or heal processes kicking off.
I checked the listed files' extended attributes on their bricks today,
and they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.
I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.
I've got 16 hours to get this sorted before the start of work, Monday. Help!
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Gambit15
2018-07-05 00:39:50 UTC
Permalink
Hi Karthik,
Many thanks for the response!
Post by Karthik Subrahmanya
Hi,
From the logs you have pasted it looks like those files are in GFID
split-brain.
They should have the GFIDs assigned on both the data bricks but they will
be different.
Can you please paste the getfattr output of those files and their parent
from all the bricks again?
The files don't have any attributes set, however I did manage to find their
corresponding entries in .glusterfs

==================================
[***@v0 .glusterfs]# getfattr -m . -d -e hex
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ea
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[***@v0 .glusterfs]# getfattr -m . -d -e hex
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file:
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000

# file:
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000

[***@v0 .glusterfs]# ls -l
/gluster/engine/brick/.glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/
total 0
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee

==================================

Again, here are the relevant client log entries:

[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.metadata
5e95ba8c-2f12-49bf-be2d-b4baf210d366 on engine-client-1 and
b9cd7613-3b96-415d-a549-1dc788a4f94d on engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata => -1
(Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.lockspace
8e86902a-c31c-4990-b0c5-0318807edb8f on engine-client-1 and
e5899a4c-dc5d-487e-84b0-9bbc73133c25 on engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)

[***@v0 .glusterfs]# find . -type f | grep -E
"5e95ba8c-2f12-49bf-be2d-b4baf210d366|8e86902a-c31c-4990-b0c5-0318807edb8f|b9cd7613-3b96-415d-a549-1dc788a4f94d|e5899a4c-dc5d-487e-84b0-9bbc73133c25"
[***@v0 .glusterfs]#

==================================
Post by Karthik Subrahmanya
Which version of gluster you are using?
3.8.5
An upgrade is on the books, however I had to go back on my last attempt as
3.12 didn't work with 3.8 & I was unable to do a live rolling upgrade. Once
I've got this GFID mess sorted out, I'll give a full upgrade a go as I've
already had to failover this cluster's services to another cluster.

If you are using a version higher than or equal to 3.12 gfid split brains
Post by Karthik Subrahmanya
can be resolved using the methods (except method 4)
explained in the "Resolution of split-brain using gluster CLI" section in
[1].
Also note that for gfid split-brain resolution using CLI you have to pass
the name of the file as argument and not the GFID.
If it is lower than 3.12 (Please consider upgrading them since they are
EOL) you have to resolve it manually as explained in [2]
[1] https://docs.gluster.org/en/latest/Troubleshooting/
resolving-splitbrain/
[2] https://docs.gluster.org/en/latest/Troubleshooting/
resolving-splitbrain/#dir-split-brain
"The user needs to remove either file '1' on brick-a or the file '1' on
brick-b to resolve the split-brain. In addition, the corresponding
gfid-link file also needs to be removed."

Okay, so as you can see above, the files don't have a trusted.gfid
attribute, & on the brick I didn't find any files in .glusterfs with the
same name as the GFID's reported in the client log. I did however find the
symlinked files in a .glusterfs directory under the parent directory's GFID.

[***@v0 .glusterfs]# ls -l
/gluster/engine/brick/.glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/
total 0
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee


So if I delete those two symlinks & the files they point to, on one of the
two bricks, will that resolve the split brain? Is that correct?
Post by Karthik Subrahmanya
Thanks & Regards,
Karthik
Post by Gambit15
Post by Ashish Pandey
The only problem at the moment is that arbiter brick offline. You should
only bother about completion of maintenance of arbiter brick ASAP.
Bring this brick UP, start FULL heal or index heal and the volume will
be in healthy state.
Doesn't the arbiter only resolve split-brain situations? None of the
files that have been marked for healing are marked as in split-brain.
The arbiter has now been brought back up, however the problem continues.
[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366
on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on
engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-
0aa70860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f
on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on
engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-
0aa70860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error)
As you can see from the logs I posted previously, neither of those two
files, on either of the two servers, have any of gluster's extended
attributes set.
The arbiter doesn't have any record of the files in question, as they
were created after it went offline.
How do I fix this? Is it possible to locate the correct gfids somewhere &
redefine them on the files manually?
Cheers,
Doug
------------------------------
Post by Ashish Pandey
*Sent: *Monday, July 2, 2018 1:45:01 AM
*Subject: *Re: [Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Ashish,
The output is below. It's a rep 2+1 volume. The arbiter is offline for
maintenance at the moment, however quorum is met & no files are reported as
in split-brain (it hosts VMs, so files aren't accessed concurrently).
======================
Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32
======================
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -
======================
=== PEER V0 ===
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563
745f723a6675736566735f743a733000
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563
745f723a6675736566735f743a733000
=== PEER V1 ===
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
======================
[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate]
0-glusterfsd: Fetching the volume file from server...
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
That's the *only* message in glustershd.log today.
======================
Status of volume: engine
Gluster process TCP Port RDMA Port
Online Pid
------------------------------------------------------------
------------------
Brick s0:/gluster/engine/brick 49154 0
Y 2816
Brick s1:/gluster/engine/brick 49154 0
Y 3995
Self-heal Daemon on localhost N/A N/A Y
2919
Self-heal Daemon on s1 N/A N/A Y
4013
Task Status of Volume engine
------------------------------------------------------------
------------------
There are no active volume tasks
======================
Okay, so actually only the directory ha_agent is listed for healing (not
its contents), & that does have attributes set.
Many thanks for the reply!
Post by Ashish Pandey
You have not even talked about the volume type and configuration and
this issue would require lot of other information to fix it.
1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal
----
Ashish
------------------------------
*Sent: *Sunday, July 1, 2018 11:50:16 PM
*Subject: *[Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a number
of the files on my gluster share have been stuck, marked as healing. After
no signs of progress, I manually set off a full heal last night, but after
24hrs, nothing's happened.
The gluster logs all look normal, and there're no messages about failed
connections or heal processes kicking off.
I checked the listed files' extended attributes on their bricks today,
and they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.
I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.
I've got 16 hours to get this sorted before the start of work, Monday. Help!
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Vlad Kopylov
2018-07-05 02:00:05 UTC
Permalink
you'll need to query attr of those files for them to be updated in .
glusterfs

regarding wiping .glusterfs - I've done it half a dozen times on live data:
it is a simple drill which fixes almost everything.
often you don't have time to ask around etc. you just need it working ASAP
so you delete gluster, wipe all configs, versions, volumes and .glusterfs
on all bricks
reinstall everything, create volumes anew pointing to bricks with existing
data
then run attr query to each file for it to populate .glusterfs (
https://lists.gluster.org/pipermail/gluster-users/2018-January/033352.html)

In real life situations this works much faster then anything else
unless you suspect network issues or something else non gluster related

-v
Post by Gambit15
Hi Karthik,
Many thanks for the response!
Post by Karthik Subrahmanya
Hi,
From the logs you have pasted it looks like those files are in GFID
split-brain.
They should have the GFIDs assigned on both the data bricks but they will
be different.
Can you please paste the getfattr output of those files and their parent
from all the bricks again?
The files don't have any attributes set, however I did manage to find
their corresponding entries in .glusterfs
==================================
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563
745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ea
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563
745f723a6675736566735f743a733000
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563
745f723a6675736566735f743a733000
glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/
total 0
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-
4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-
485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee
==================================
[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366
on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on
engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-
0aa70860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f
on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on
engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-
0aa70860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error)
b4baf210d366|8e86902a-c31c-4990-b0c5-0318807edb8f|b9cd7613-3b96-415d-a549-
1dc788a4f94d|e5899a4c-dc5d-487e-84b0-9bbc73133c25"
==================================
Post by Karthik Subrahmanya
Which version of gluster you are using?
3.8.5
An upgrade is on the books, however I had to go back on my last attempt as
3.12 didn't work with 3.8 & I was unable to do a live rolling upgrade. Once
I've got this GFID mess sorted out, I'll give a full upgrade a go as I've
already had to failover this cluster's services to another cluster.
If you are using a version higher than or equal to 3.12 gfid split brains
Post by Karthik Subrahmanya
can be resolved using the methods (except method 4)
explained in the "Resolution of split-brain using gluster CLI" section in
[1].
Also note that for gfid split-brain resolution using CLI you have to pass
the name of the file as argument and not the GFID.
If it is lower than 3.12 (Please consider upgrading them since they are
EOL) you have to resolve it manually as explained in [2]
[1] https://docs.gluster.org/en/latest/Troubleshooting/resol
ving-splitbrain/
[2] https://docs.gluster.org/en/latest/Troubleshooting/resol
ving-splitbrain/#dir-split-brain
"The user needs to remove either file '1' on brick-a or the file '1' on
brick-b to resolve the split-brain. In addition, the corresponding
gfid-link file also needs to be removed."
Okay, so as you can see above, the files don't have a trusted.gfid
attribute, & on the brick I didn't find any files in .glusterfs with the
same name as the GFID's reported in the client log. I did however find the
symlinked files in a .glusterfs directory under the parent directory's GFID.
glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/
total 0
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-
4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-
485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee
So if I delete those two symlinks & the files they point to, on one of the
two bricks, will that resolve the split brain? Is that correct?
Post by Karthik Subrahmanya
Thanks & Regards,
Karthik
Post by Gambit15
Post by Ashish Pandey
The only problem at the moment is that arbiter brick offline. You
should only bother about completion of maintenance of arbiter brick ASAP.
Bring this brick UP, start FULL heal or index heal and the volume will
be in healthy state.
Doesn't the arbiter only resolve split-brain situations? None of the
files that have been marked for healing are marked as in split-brain.
The arbiter has now been brought back up, however the problem continues.
[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366
on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on
engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70
860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f
on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on
engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70
860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error)
As you can see from the logs I posted previously, neither of those two
files, on either of the two servers, have any of gluster's extended
attributes set.
The arbiter doesn't have any record of the files in question, as they
were created after it went offline.
How do I fix this? Is it possible to locate the correct gfids somewhere
& redefine them on the files manually?
Cheers,
Doug
------------------------------
Post by Ashish Pandey
*Sent: *Monday, July 2, 2018 1:45:01 AM
*Subject: *Re: [Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Ashish,
The output is below. It's a rep 2+1 volume. The arbiter is offline for
maintenance at the moment, however quorum is met & no files are reported as
in split-brain (it hosts VMs, so files aren't accessed concurrently).
======================
Volume Name: engine
Type: Replicate
Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: s0:/gluster/engine/brick
Brick2: s1:/gluster/engine/brick
Brick3: s2:/gluster/engine/arbiter (arbiter)
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
performance.low-prio-threads: 32
======================
Brick s0:/gluster/engine/brick
/__DIRECT_IO_TEST__
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
/98495dbc-a29c-4893-b6a0-0aa70860d0c9
<LIST TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s1:/gluster/engine/brick
<SAME AS ABOVE - TRUNCATED FOR BREVITY>
Status: Connected
Number of entries: 34
Brick s2:/gluster/engine/arbiter
Status: Ponto final de transporte não está conectado
Number of entries: -
======================
=== PEER V0 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024e8
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a66757
36566735f743a733000
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
=== PEER V1 ===
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ec
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
======================
[2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
[2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS
<LOGS FROM 06/01 TRUNCATED>
[2018-07-01 07:15:04.779122] I [MSGID: 100011]
[glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file
from server...
[2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
That's the *only* message in glustershd.log today.
======================
Status of volume: engine
Gluster process TCP Port RDMA Port
Online Pid
------------------------------------------------------------
------------------
Brick s0:/gluster/engine/brick 49154 0
Y 2816
Brick s1:/gluster/engine/brick 49154 0
Y 3995
Self-heal Daemon on localhost N/A N/A
Y 2919
Self-heal Daemon on s1 N/A N/A
Y 4013
Task Status of Volume engine
------------------------------------------------------------
------------------
There are no active volume tasks
======================
Okay, so actually only the directory ha_agent is listed for healing
(not its contents), & that does have attributes set.
Many thanks for the reply!
Post by Ashish Pandey
You have not even talked about the volume type and configuration and
this issue would require lot of other information to fix it.
1 - What is the type of volume and config.
2 - Provide the gluster v <volname> info out put
3 - Heal info out put
4 - getxattr of one of the file, which needs healing, from all the bricks.
5 - What lead to the healing of file?
6 - gluster v <volname> status
7 - glustershd.log out put just after you run full heal or index heal
----
Ashish
------------------------------
*Sent: *Sunday, July 1, 2018 11:50:16 PM
*Subject: *[Gluster-users] Files not healing & missing their
extended attributes - Help!
Hi Guys,
I had to restart our datacenter yesterday, but since doing so a
number of the files on my gluster share have been stuck, marked as healing.
After no signs of progress, I manually set off a full heal last night, but
after 24hrs, nothing's happened.
The gluster logs all look normal, and there're no messages about
failed connections or heal processes kicking off.
I checked the listed files' extended attributes on their bricks today,
and they only show the selinux attribute. There's none of the trusted.*
attributes I'd expect.
The healthy files on the bricks do have their extended attributes though.
I'm guessing that perhaps the files somehow lost their attributes, and
gluster is no longer able to work out what to do with them? It's not logged
any errors, warnings, or anything else out of the normal though, so I've no
idea what the problem is or how to resolve it.
I've got 16 hours to get this sorted before the start of work, Monday. Help!
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Continue reading on narkive:
Loading...