Discussion:
[Gluster-devel] Query on healing process
(too old to reply)
ABHISHEK PALIWAL
2016-03-03 05:44:55 UTC
Permalink
Hi Ravi,

As I discussed earlier this issue, I investigated this issue and find that
healing is not triggered because the "gluster volume heal c_glusterfs info
split-brain" command not showing any entries as a outcome of this command
even though the file in split brain case.

So, what I have done I manually deleted the gfid entry of that file from
.glusterfs directory and follow the instruction mentioned in the following
link to do heal

https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md

and this works fine for me.

But my question is why the split-brain command not showing any file in
output.

Here I am attaching all the log which I get from the node for you and also
the output of commands from both of the boards

In this tar file two directories are present

000300 - log for the board which is running continuously
002500- log for the board which is rebooted

I am waiting for your reply please help me out on this issue.

Thanks in advanced.

Regards,
Abhishek
Yes correct
Okay, so when you say the files are not in sync until some time, are you
getting stale data when accessing from the mount?
I'm not able to figure out why heal info shows zero when the files are
not in sync, despite all IO happening from the mounts. Could you provide
the output of getfattr -d -m . -e hex /brick/file-name from both bricks
when you hit this issue?
I'll provide the logs once I get. here delay means we are powering on the
second board after the 10 minutes.
Hello,
Hi Ravi,
Thanks for the response.
We are using Glugsterfs-3.7.8
We have a logging file which saves logs of the events for every board of
a node and these files are in sync using glusterfs. System in replica 2
mode it means When one brick in a replicated volume goes offline, the
glusterd daemons on the other nodes keep track of all the files that are
not replicated to the offline brick. When the offline brick becomes
available again, the cluster initiates a healing process, replicating the
updated files to that brick. But in our casse, we see that log file of
one board is not in the sync and its format is corrupted means files are
not in sync.
Just to understand you correctly, you have mounted the 2 node replica-2
volume on both these nodes and writing to a logging file from the mounts
right?
Even the outcome of #gluster volume heal c_glusterfs info shows that
there is no pending heals.
Also , The logging file which is updated is of fixed size and the new
entries will be wrapped ,overwriting the old entries.
This way we have seen that after few restarts , the contents of the same
file on two bricks are different , but the volume heal info shows zero
entries
But when we tried to put delay > 5 min before the healing everything
is working fine.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N <
Hi,
Here, I have one query regarding the time taken by the healing process.
In current two node setup when we rebooted one node then the
self-healing process starts less than 5min interval on the board which
resulting the corruption of the some files data.
Heal should start immediately after the brick process comes up. What
version of gluster are you using? What do you mean by corruption of data?
Also, how did you observe that the heal started after 5 minutes?
-Ravi
https://support.rackspace.com/how-to/glusterfs-troubleshooting/
Mentioning that the healing process can takes upto 10min of time to
start this process.
"Healing replicated volumes
When any brick in a replicated volume goes offline, the glusterd
daemons on the remaining nodes keep track of all the files that are not
replicated to the offline brick. When the offline brick becomes available
again, the cluster initiates a healing process, replicating the updated
files to that brick. *The start of this process can take up to 10
minutes, based on observation.*"
After giving the time of more than 5 min file corruption problem has
been resolved.
So, Here my question is there any way through which we can reduce the
time taken by the healing process to start?
Regards,
Abhishek Paliwal
_______________________________________________
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
Ravishankar N
2016-03-03 10:40:21 UTC
Permalink
Hi,
Post by ABHISHEK PALIWAL
Hi Ravi,
As I discussed earlier this issue, I investigated this issue and find
that healing is not triggered because the "gluster volume heal
c_glusterfs info split-brain" command not showing any entries as a
outcome of this command even though the file in split brain case.
Couple of observations from the 'commands_output' file.

getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
The afr xattrs do not indicate that the file is in split brain:
# file:
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae



getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000080000000000000000
trusted.afr.c_glusterfs-client-2=0x000000020000000000000000
trusted.afr.c_glusterfs-client-4=0x000000020000000000000000
trusted.afr.c_glusterfs-client-6=0x000000020000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae

1. There doesn't seem to be a split-brain going by the trusted.afr* xattrs.
2. You seem to have re-used the bricks from another volume/setup. For
replica 2, only trusted.afr.c_glusterfs-client-0 and
trusted.afr.c_glusterfs-client-1 must be present but I see 4 xattrs -
client-0,2,4 and 6
3. On the rebooted node, do you have ssl enabled by any chance? There is
a bug for "Not able to fetch volfile' when ssl is enabled:
https://bugzilla.redhat.com/show_bug.cgi?id=1258931

Btw, you for data and metadata split-brains you can use the gluster CLI
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
instead of modifying the file from the back end.

-Ravi
Post by ABHISHEK PALIWAL
So, what I have done I manually deleted the gfid entry of that file
from .glusterfs directory and follow the instruction mentioned in the
following link to do heal
https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
and this works fine for me.
But my question is why the split-brain command not showing any file in
output.
Here I am attaching all the log which I get from the node for you and
also the output of commands from both of the boards
In this tar file two directories are present
000300 - log for the board which is running continuously
002500- log for the board which is rebooted
I am waiting for your reply please help me out on this issue.
Thanks in advanced.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL
On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N
Yes correct
Okay, so when you say the files are not in sync until some
time, are you getting stale data when accessing from the mount?
I'm not able to figure out why heal info shows zero when the
files are not in sync, despite all IO happening from the
mounts. Could you provide the output of getfattr -d -m . -e
hex /brick/file-name from both bricks when you hit this issue?
I'll provide the logs once I get. here delay means we are
powering on the second board after the 10 minutes.
On Feb 26, 2016 9:57 AM, "Ravishankar N"
Hello,
Post by ABHISHEK PALIWAL
Hi Ravi,
Thanks for the response.
We are using Glugsterfs-3.7.8
We have a logging file which saves logs of the events
for every board of a node and these files are in sync
using glusterfs. System in replica 2 mode it means When
one brick in a replicated volume goes offline, the
glusterd daemons on the other nodes keep track of all
the files that are not replicated to the offline brick.
When the offline brick becomes available again, the
cluster initiates a healing process, replicating the
updated files to that brick. But in our casse, we see
that log file of one board is not in the sync and its
format is corrupted means files are not in sync.
Just to understand you correctly, you have mounted the 2
node replica-2 volume on both these nodes and writing to
a logging file from the mounts right?
Post by ABHISHEK PALIWAL
Even the outcome of #gluster volume heal c_glusterfs
info shows that there is no pending heals.
Also , The logging file which is updated is of fixed
size and the new entries will be wrapped ,overwriting
the old entries.
This way we have seen that after few restarts , the
contents of the same file on two bricks are different ,
but the volume heal info shows zero entries
But when we tried to put delay > 5 min before the
healing everything is working fine.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N
Hi,
Here, I have one query regarding the time taken by
the healing process.
In current two node setup when we rebooted one node
then the self-healing process starts less than 5min
interval on the board which resulting the
corruption of the some files data.
Heal should start immediately after the brick
process comes up. What version of gluster are you
using? What do you mean by corruption of data? Also,
how did you observe that the heal started after 5
minutes?
-Ravi
And to resolve it I have search on google and found
https://support.rackspace.com/how-to/glusterfs-troubleshooting/
Mentioning that the healing process can takes upto
10min of time to start this process.
"Healing replicated volumes
When any brick in a replicated volume goes offline,
the glusterd daemons on the remaining nodes keep
track of all the files that are not replicated to
the offline brick. When the offline brick becomes
available again, the cluster initiates a healing
process, replicating the updated files to that
brick. *The start of this process can take up to 10
minutes, based on observation.*"
After giving the time of more than 5 min file
corruption problem has been resolved.
So, Here my question is there any way through which
we can reduce the time taken by the healing process
to start?
Regards,
Abhishek Paliwal
_______________________________________________
Gluster-devel mailing list
http://www.gluster.org/mailman/listinfo/gluster-devel
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
ABHISHEK PALIWAL
2016-03-03 11:24:03 UTC
Permalink
Post by ABHISHEK PALIWAL
Hi,
Hi Ravi,
As I discussed earlier this issue, I investigated this issue and find that
healing is not triggered because the "gluster volume heal c_glusterfs info
split-brain" command not showing any entries as a outcome of this command
even though the file in split brain case.
Couple of observations from the 'commands_output' file.
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000080000000000000000
trusted.afr.c_glusterfs-client-2=0x000000020000000000000000
trusted.afr.c_glusterfs-client-4=0x000000020000000000000000
trusted.afr.c_glusterfs-client-6=0x000000020000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
1. There doesn't seem to be a split-brain going by the trusted.afr* xattrs.
if it is not the split brain problem then how can I resolve this.
Post by ABHISHEK PALIWAL
2. You seem to have re-used the bricks from another volume/setup. For
replica 2, only trusted.afr.c_glusterfs-client-0 and
trusted.afr.c_glusterfs-client-1 must be present but I see 4 xattrs -
client-0,2,4 and 6
could you please suggest why these entries are there because I am not able
to find out scenario. I am rebooting the one board multiple times to
reproduce the issue and after every reboot doing the remove-brick and
add-brick on the same volume for the second board.
Post by ABHISHEK PALIWAL
3. On the rebooted node, do you have ssl enabled by any chance? There is a
https://bugzilla.redhat.com/show_bug.cgi?id=1258931
Btw, you for data and metadata split-brains you can use the gluster CLI
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
instead of modifying the file from the back end.
But you are saying it is not split brain problem and even the split-brain
command is not showing any file so how can I find the bigger file in size.
Also in my case the file size is fix 2MB it is overwritten every time.
Post by ABHISHEK PALIWAL
-Ravi
So, what I have done I manually deleted the gfid entry of that file from
.glusterfs directory and follow the instruction mentioned in the following
link to do heal
https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
and this works fine for me.
But my question is why the split-brain command not showing any file in
output.
Here I am attaching all the log which I get from the node for you and also
the output of commands from both of the boards
In this tar file two directories are present
000300 - log for the board which is running continuously
002500- log for the board which is rebooted
I am waiting for your reply please help me out on this issue.
Thanks in advanced.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL <
On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N <
Yes correct
Okay, so when you say the files are not in sync until some time, are you
getting stale data when accessing from the mount?
I'm not able to figure out why heal info shows zero when the files are
not in sync, despite all IO happening from the mounts. Could you provide
the output of getfattr -d -m . -e hex /brick/file-name from both bricks
when you hit this issue?
I'll provide the logs once I get. here delay means we are powering on
the second board after the 10 minutes.
Hello,
Hi Ravi,
Thanks for the response.
We are using Glugsterfs-3.7.8
We have a logging file which saves logs of the events for every board
of a node and these files are in sync using glusterfs. System in replica 2
mode it means When one brick in a replicated volume goes offline, the
glusterd daemons on the other nodes keep track of all the files that are
not replicated to the offline brick. When the offline brick becomes
available again, the cluster initiates a healing process, replicating the
updated files to that brick. But in our casse, we see that log file of
one board is not in the sync and its format is corrupted means files are
not in sync.
Just to understand you correctly, you have mounted the 2 node replica-2
volume on both these nodes and writing to a logging file from the mounts
right?
Even the outcome of #gluster volume heal c_glusterfs info shows that
there is no pending heals.
Also , The logging file which is updated is of fixed size and the new
entries will be wrapped ,overwriting the old entries.
This way we have seen that after few restarts , the contents of the
same file on two bricks are different , but the volume heal info shows zero
entries
But when we tried to put delay > 5 min before the healing everything
is working fine.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N <
Hi,
Here, I have one query regarding the time taken by the healing process.
In current two node setup when we rebooted one node then the
self-healing process starts less than 5min interval on the board which
resulting the corruption of the some files data.
Heal should start immediately after the brick process comes up. What
version of gluster are you using? What do you mean by corruption of data?
Also, how did you observe that the heal started after 5 minutes?
-Ravi
<https://support.rackspace.com/how-to/glusterfs-troubleshooting/>
https://support.rackspace.com/how-to/glusterfs-troubleshooting/
Mentioning that the healing process can takes upto 10min of time to
start this process.
"Healing replicated volumes
When any brick in a replicated volume goes offline, the glusterd
daemons on the remaining nodes keep track of all the files that are not
replicated to the offline brick. When the offline brick becomes available
again, the cluster initiates a healing process, replicating the updated
files to that brick. *The start of this process can take up to 10
minutes, based on observation.*"
After giving the time of more than 5 min file corruption problem has
been resolved.
So, Here my question is there any way through which we can reduce the
time taken by the healing process to start?
Regards,
Abhishek Paliwal
_______________________________________________
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
ABHISHEK PALIWAL
2016-03-04 06:40:29 UTC
Permalink
Hi Ravi,

3. On the rebooted node, do you have ssl enabled by any chance? There is a
bug for "Not able to fetch volfile' when ssl is enabled:
https://bugzilla.redhat.com/show_bug.cgi?id=1258931

->>>>> I have checked but ssl is disabled but still getting these errors

# gluster volume heal c_glusterfs info
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.

# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.

And based on the your observation I understood that this is not the problem
of split-brain but *is there any way through which can find out the file
which is not in split-brain as well as not in sync?*

# getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
# file:
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000000000000000000000
trusted.afr.c_glusterfs-client-2=0x000000000000000000000000
trusted.afr.c_glusterfs-client-4=0x000000000000000000000000
trusted.afr.c_glusterfs-client-6=0x000000000000000000000000
trusted.afr.c_glusterfs-client-8=*0x000000060000000000000000** //because
client8 is the latest client in our case and starting 8 digits *

*00000006....are saying like there is something in changelog data.*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001356d86c0c000217fd
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae

# lhsh 002500 getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
# file:
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=*0x000000000000000000000000** // and here
we can say that there is no split brain but the file is out of sync*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001156d86c290005735c
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae

# gluster volume info

Volume Name: c_glusterfs
Type: Replicate
Volume ID: c6a61455-d378-48bf-ad40-7a3ce897fc9c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.32.0.48:/opt/lvmdir/c2/brick
Brick2: 10.32.1.144:/opt/lvmdir/c2/brick
Options Reconfigured:
performance.readdir-ahead: on
network.ping-timeout: 4
nfs.disable: on


# gluster volume info

Volume Name: c_glusterfs
Type: Replicate
Volume ID: c6a61455-d378-48bf-ad40-7a3ce897fc9c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.32.0.48:/opt/lvmdir/c2/brick
Brick2: 10.32.1.144:/opt/lvmdir/c2/brick
Options Reconfigured:
performance.readdir-ahead: on
network.ping-timeout: 4
nfs.disable: on

# gluster --version
glusterfs 3.7.8 built on Feb 17 2016 07:49:49
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com
<https://prod-webmail.windriver.com/owa/redir.aspx?SURL=1n3NinBc2tJluL9mRvtdRtuM7FXSFmZ7aHgTkNSgQ7vm1RuX9kPTCGgAdAB0AHAAOgAvAC8AdwB3AHcALgBnAGwAdQBzAHQAZQByAC4AYwBvAG0ALwA.&URL=http%3a%2f%2fwww.gluster.com%2f>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General
Public License.
# gluster volume heal info heal-failed
Usage: volume heal <VOLNAME> [enable | disable | full |statistics
[heal-count [replica <HOSTNAME:BRICKNAME>]] |info [healed | heal-failed |
split-brain] |split-brain {bigger-file <FILE> |source-brick
<HOSTNAME:BRICKNAME> [<FILE>]}]
# gluster volume heal c_glusterfs info heal-failed
Command not supported. Please use "gluster volume heal c_glusterfs info"
and logs to find the heal information.
# lhsh 002500
_______ _____ _____ _____ __ _ _ _ _ _
| |_____] |_____] | | | \ | | | \___/
|_____ | | |_____ __|__ | \_| |_____| _/ \_

002500> gluster --version
glusterfs 3.7.8 built on Feb 17 2016 07:49:49
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com
<https://prod-webmail.windriver.com/owa/redir.aspx?SURL=1n3NinBc2tJluL9mRvtdRtuM7FXSFmZ7aHgTkNSgQ7vm1RuX9kPTCGgAdAB0AHAAOgAvAC8AdwB3AHcALgBnAGwAdQBzAHQAZQByAC4AYwBvAG0ALwA.&URL=http%3a%2f%2fwww.gluster.com%2f>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General
Public License.
002500>

Regards,
Abhishek
Post by ABHISHEK PALIWAL
Post by ABHISHEK PALIWAL
Hi,
Hi Ravi,
As I discussed earlier this issue, I investigated this issue and find
that healing is not triggered because the "gluster volume heal c_glusterfs
info split-brain" command not showing any entries as a outcome of this
command even though the file in split brain case.
Couple of observations from the 'commands_output' file.
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000080000000000000000
trusted.afr.c_glusterfs-client-2=0x000000020000000000000000
trusted.afr.c_glusterfs-client-4=0x000000020000000000000000
trusted.afr.c_glusterfs-client-6=0x000000020000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
1. There doesn't seem to be a split-brain going by the trusted.afr* xattrs.
if it is not the split brain problem then how can I resolve this.
Post by ABHISHEK PALIWAL
2. You seem to have re-used the bricks from another volume/setup. For
replica 2, only trusted.afr.c_glusterfs-client-0 and
trusted.afr.c_glusterfs-client-1 must be present but I see 4 xattrs -
client-0,2,4 and 6
could you please suggest why these entries are there because I am not able
to find out scenario. I am rebooting the one board multiple times to
reproduce the issue and after every reboot doing the remove-brick and
add-brick on the same volume for the second board.
Post by ABHISHEK PALIWAL
3. On the rebooted node, do you have ssl enabled by any chance? There is
https://bugzilla.redhat.com/show_bug.cgi?id=1258931
Btw, you for data and metadata split-brains you can use the gluster CLI
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
instead of modifying the file from the back end.
But you are saying it is not split brain problem and even the split-brain
command is not showing any file so how can I find the bigger file in size.
Also in my case the file size is fix 2MB it is overwritten every time.
Post by ABHISHEK PALIWAL
-Ravi
So, what I have done I manually deleted the gfid entry of that file from
.glusterfs directory and follow the instruction mentioned in the following
link to do heal
https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
and this works fine for me.
But my question is why the split-brain command not showing any file in
output.
Here I am attaching all the log which I get from the node for you and
also the output of commands from both of the boards
In this tar file two directories are present
000300 - log for the board which is running continuously
002500- log for the board which is rebooted
I am waiting for your reply please help me out on this issue.
Thanks in advanced.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL <
On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N <
Yes correct
Okay, so when you say the files are not in sync until some time, are
you getting stale data when accessing from the mount?
I'm not able to figure out why heal info shows zero when the files are
not in sync, despite all IO happening from the mounts. Could you provide
the output of getfattr -d -m . -e hex /brick/file-name from both bricks
when you hit this issue?
I'll provide the logs once I get. here delay means we are powering on
the second board after the 10 minutes.
Hello,
Hi Ravi,
Thanks for the response.
We are using Glugsterfs-3.7.8
We have a logging file which saves logs of the events for every board
of a node and these files are in sync using glusterfs. System in replica 2
mode it means When one brick in a replicated volume goes offline, the
glusterd daemons on the other nodes keep track of all the files that are
not replicated to the offline brick. When the offline brick becomes
available again, the cluster initiates a healing process, replicating the
updated files to that brick. But in our casse, we see that log file
of one board is not in the sync and its format is corrupted means files are
not in sync.
Just to understand you correctly, you have mounted the 2 node
replica-2 volume on both these nodes and writing to a logging file from the
mounts right?
Even the outcome of #gluster volume heal c_glusterfs info shows that
there is no pending heals.
Also , The logging file which is updated is of fixed size and the new
entries will be wrapped ,overwriting the old entries.
This way we have seen that after few restarts , the contents of the
same file on two bricks are different , but the volume heal info shows zero
entries
But when we tried to put delay > 5 min before the healing everything
is working fine.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N <
Hi,
Here, I have one query regarding the time taken by the healing process.
In current two node setup when we rebooted one node then the
self-healing process starts less than 5min interval on the board which
resulting the corruption of the some files data.
Heal should start immediately after the brick process comes up. What
version of gluster are you using? What do you mean by corruption of data?
Also, how did you observe that the heal started after 5 minutes?
-Ravi
<https://support.rackspace.com/how-to/glusterfs-troubleshooting/>
https://support.rackspace.com/how-to/glusterfs-troubleshooting/
Mentioning that the healing process can takes upto 10min of time to
start this process.
"Healing replicated volumes
When any brick in a replicated volume goes offline, the glusterd
daemons on the remaining nodes keep track of all the files that are not
replicated to the offline brick. When the offline brick becomes available
again, the cluster initiates a healing process, replicating the updated
files to that brick. *The start of this process can take up to 10
minutes, based on observation.*"
After giving the time of more than 5 min file corruption problem has
been resolved.
So, Here my question is there any way through which we can reduce the
time taken by the healing process to start?
Regards,
Abhishek Paliwal
_______________________________________________
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
Ravishankar N
2016-03-04 12:01:18 UTC
Permalink
Post by ABHISHEK PALIWAL
Hi Ravi,
3. On the rebooted node, do you have ssl enabled by any chance? There
https://bugzilla.redhat.com/show_bug.cgi?id=1258931
->>>>> I have checked but ssl is disabled but still getting these errors
# gluster volume heal c_glusterfs info
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
Ok, just to confirm, glusterd and other brick processes are running
after this node rebooted?
When you run the above command, you need to check
/var/log/glusterfs/glfsheal-volname.log logs errros. Setting
client-log-level to DEBUG would give you a more verbose message
Post by ABHISHEK PALIWAL
# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
And based on the your observation I understood that this is not the
problem of split-brain but *is there any way through which can find
out the file which is not in split-brain as well as not in sync?*
`gluster volume heal c_glusterfs info split-brain` should give you files
that need heal.
Post by ABHISHEK PALIWAL
# getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000000000000000000000
trusted.afr.c_glusterfs-client-2=0x000000000000000000000000
trusted.afr.c_glusterfs-client-4=0x000000000000000000000000
trusted.afr.c_glusterfs-client-6=0x000000000000000000000000
trusted.afr.c_glusterfs-client-8=*0x000000060000000000000000**//because client8
is the latest client in our case and starting 8 digits **
*
*00000006....are saying like there is something in changelog data.
*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001356d86c0c000217fd
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
# lhsh 002500 getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=*0x000000000000000000000000**// and
here we can say that there is no split brain but the file is out of sync*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001156d86c290005735c
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
# gluster volume info
Volume Name: c_glusterfs
Type: Replicate
Volume ID: c6a61455-d378-48bf-ad40-7a3ce897fc9c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 10.32.0.48:/opt/lvmdir/c2/brick
Brick2: 10.32.1.144:/opt/lvmdir/c2/brick
performance.readdir-ahead: on
network.ping-timeout: 4
nfs.disable: on
# gluster volume info
Volume Name: c_glusterfs
Type: Replicate
Volume ID: c6a61455-d378-48bf-ad40-7a3ce897fc9c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 10.32.0.48:/opt/lvmdir/c2/brick
Brick2: 10.32.1.144:/opt/lvmdir/c2/brick
performance.readdir-ahead: on
network.ping-timeout: 4
nfs.disable: on
# gluster --version
glusterfs 3.7.8 built on Feb 17 2016 07:49:49
Repository revision: git://git.gluster.com/glusterfs.git
<http://git.gluster.com/glusterfs.git>
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com
<https://prod-webmail.windriver.com/owa/redir.aspx?SURL=1n3NinBc2tJluL9mRvtdRtuM7FXSFmZ7aHgTkNSgQ7vm1RuX9kPTCGgAdAB0AHAAOgAvAC8AdwB3AHcALgBnAGwAdQBzAHQAZQByAC4AYwBvAG0ALwA.&URL=http%3a%2f%2fwww.gluster.com%2f>>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License.
# gluster volume heal info heal-failed
Usage: volume heal <VOLNAME> [enable | disable | full |statistics
[heal-count [replica <HOSTNAME:BRICKNAME>]] |info [healed |
heal-failed | split-brain] |split-brain {bigger-file <FILE>
|source-brick <HOSTNAME:BRICKNAME> [<FILE>]}]
# gluster volume heal c_glusterfs info heal-failed
Command not supported. Please use "gluster volume heal c_glusterfs
info" and logs to find the heal information.
# lhsh 002500
_______ _____ _____ _____ __ _ _ _ _ _
| |_____] |_____] | | | \ | | | \___/
|_____ | | |_____ __|__ | \_| |_____| _/ \_
002500> gluster --version
glusterfs 3.7.8 built on Feb 17 2016 07:49:49
Repository revision: git://git.gluster.com/glusterfs.git
<http://git.gluster.com/glusterfs.git>
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com
<https://prod-webmail.windriver.com/owa/redir.aspx?SURL=1n3NinBc2tJluL9mRvtdRtuM7FXSFmZ7aHgTkNSgQ7vm1RuX9kPTCGgAdAB0AHAAOgAvAC8AdwB3AHcALgBnAGwAdQBzAHQAZQByAC4AYwBvAG0ALwA.&URL=http%3a%2f%2fwww.gluster.com%2f>>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License.
002500>
Regards,
Abhishek
On Thu, Mar 3, 2016 at 4:54 PM, ABHISHEK PALIWAL
On Thu, Mar 3, 2016 at 4:10 PM, Ravishankar N
Hi,
Post by ABHISHEK PALIWAL
Hi Ravi,
As I discussed earlier this issue, I investigated this issue
and find that healing is not triggered because the "gluster
volume heal c_glusterfs info split-brain" command not showing
any entries as a outcome of this command even though the file
in split brain case.
Couple of observations from the 'commands_output' file.
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000080000000000000000
trusted.afr.c_glusterfs-client-2=0x000000020000000000000000
trusted.afr.c_glusterfs-client-4=0x000000020000000000000000
trusted.afr.c_glusterfs-client-6=0x000000020000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
1. There doesn't seem to be a split-brain going by the
trusted.afr* xattrs.
if it is not the split brain problem then how can I resolve this.
2. You seem to have re-used the bricks from another
volume/setup. For replica 2, only
trusted.afr.c_glusterfs-client-0 and
trusted.afr.c_glusterfs-client-1 must be present but I see 4
xattrs - client-0,2,4 and 6
could you please suggest why these entries are there because I am
not able to find out scenario. I am rebooting the one board
multiple times to reproduce the issue and after every reboot doing
the remove-brick and add-brick on the same volume for the second
board.
3. On the rebooted node, do you have ssl enabled by any
chance? There is a bug for "Not able to fetch volfile' when
https://bugzilla.redhat.com/show_bug.cgi?id=1258931
Btw, you for data and metadata split-brains you can use the gluster CLI
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
instead of modifying the file from the back end.
But you are saying it is not split brain problem and even the
split-brain command is not showing any file so how can I find the
bigger file in size. Also in my case the file size is fix 2MB it
is overwritten every time.
-Ravi
Post by ABHISHEK PALIWAL
So, what I have done I manually deleted the gfid entry of
that file from .glusterfs directory and follow the
instruction mentioned in the following link to do heal
https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
and this works fine for me.
But my question is why the split-brain command not showing
any file in output.
Here I am attaching all the log which I get from the node for
you and also the output of commands from both of the boards
In this tar file two directories are present
000300 - log for the board which is running continuously
002500- log for the board which is rebooted
I am waiting for your reply please help me out on this issue.
Thanks in advanced.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL
On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N
Yes correct
Okay, so when you say the files are not in sync until
some time, are you getting stale data when accessing
from the mount?
I'm not able to figure out why heal info shows zero
when the files are not in sync, despite all IO
happening from the mounts. Could you provide the
output of getfattr -d -m . -e hex /brick/file-name
from both bricks when you hit this issue?
I'll provide the logs once I get. here delay means we
are powering on the second board after the 10 minutes.
On Feb 26, 2016 9:57 AM, "Ravishankar N"
Hello,
Post by ABHISHEK PALIWAL
Hi Ravi,
Thanks for the response.
We are using Glugsterfs-3.7.8
We have a logging file which saves logs of the
events for every board of a node and these
files are in sync using glusterfs. System in
replica 2 mode it means When one brick in a
replicated volume goes offline, the glusterd
daemons on the other nodes keep track of all
the files that are not replicated to the
offline brick. When the offline brick becomes
available again, the cluster initiates a
healing process, replicating the updated files
to that brick. But in our casse, we see that
log file of one board is not in the sync and
its format is corrupted means files are not in
sync.
Just to understand you correctly, you have
mounted the 2 node replica-2 volume on both
these nodes and writing to a logging file from
the mounts right?
Post by ABHISHEK PALIWAL
Even the outcome of #gluster volume heal
c_glusterfs info shows that there is no pending
heals.
Also , The logging file which is updated is of
fixed size and the new entries will be wrapped
,overwriting the old entries.
This way we have seen that after few restarts ,
the contents of the same file on two bricks are
different , but the volume heal info shows zero
entries
But when we tried to put delay > 5 min before
the healing everything is working fine.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N
Hi,
Here, I have one query regarding the time
taken by the healing process.
In current two node setup when we rebooted
one node then the self-healing process
starts less than 5min interval on the
board which resulting the corruption of
the some files data.
Heal should start immediately after the
brick process comes up. What version of
gluster are you using? What do you mean by
corruption of data? Also, how did you
observe that the heal started after 5 minutes?
-Ravi
And to resolve it I have search on google
https://support.rackspace.com/how-to/glusterfs-troubleshooting/
Mentioning that the healing process can
takes upto 10min of time to start this
process.
"Healing replicated volumes
When any brick in a replicated volume goes
offline, the glusterd daemons on the
remaining nodes keep track of all the
files that are not replicated to the
offline brick. When the offline brick
becomes available again, the cluster
initiates a healing process, replicating
the updated files to that brick. *The
start of this process can take up to 10
minutes, based on observation.*"
After giving the time of more than 5 min
file corruption problem has been resolved.
So, Here my question is there any way
through which we can reduce the time taken
by the healing process to start?
Regards,
Abhishek Paliwal
_______________________________________________
Gluster-devel mailing list
http://www.gluster.org/mailman/listinfo/gluster-devel
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
ABHISHEK PALIWAL
2016-03-04 12:53:56 UTC
Permalink
Post by ABHISHEK PALIWAL
Hi Ravi,
3. On the rebooted node, do you have ssl enabled by any chance? There is a
<https://bugzilla.redhat.com/show_bug.cgi?id=1258931>
https://bugzilla.redhat.com/show_bug.cgi?id=1258931
->>>>> I have checked but ssl is disabled but still getting these errors
# gluster volume heal c_glusterfs info
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
Ok, just to confirm, glusterd and other brick processes are running after
this node rebooted?
When you run the above command, you need to check
/var/log/glusterfs/glfsheal-volname.log logs errros. Setting
client-log-level to DEBUG would give you a more verbose message
Yes, glusterd and other brick processes running fine. I have check the
/var/log/glusterfs/glfsheal-volname.log file without the log-level= DEBUG.
Here is the logs from that file

[2016-03-02 13:51:39.059440] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2016-03-02 13:51:39.072172] W [MSGID: 101012]
[common-utils.c:2776:gf_get_reserved_ports] 0-glusterfs: could not open the
file /proc/sys/net/ipv4/ip_local_reserved_ports for getting reserved ports
info [No such file or directory]
[2016-03-02 13:51:39.072228] W [MSGID: 101081]
[common-utils.c:2810:gf_process_reserved_ports] 0-glusterfs: Not able to
get reserved ports, hence there is a possibility that glusterfs may consume
reserved port
[2016-03-02 13:51:39.072583] E [socket.c:2278:socket_connect_finish]
0-gfapi: connection to 127.0.0.1:24007 failed (Connection refused)
[2016-03-02 13:51:39.072663] E [MSGID: 104024]
[glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with
remote-host: localhost (Transport endpoint is not connected) [Transport
endpoint is not connected]
[2016-03-02 13:51:39.072700] I [MSGID: 104025]
[glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile
servers [Transport endpoint is not connected]
Post by ABHISHEK PALIWAL
# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
And based on the your observation I understood that this is not the
problem of split-brain but *is there any way through which can find out
the file which is not in split-brain as well as not in sync?*
`gluster volume heal c_glusterfs info split-brain` should give you files
that need heal.
I have run "gluster volume heal c_glusterfs info split-brain" command but
it is not showing that file which is out of sync that is the issue file is
not in sync on both of the brick and split-brain is not showing that
command in output for heal required.

Thats is why I am asking that is there any command other than this split
brain command so that I can find out the files those are required the heal
operation but not displayed in the output of "gluster volume heal
c_glusterfs info split-brain" command.
Post by ABHISHEK PALIWAL
# getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000000000000000000000
trusted.afr.c_glusterfs-client-2=0x000000000000000000000000
trusted.afr.c_glusterfs-client-4=0x000000000000000000000000
trusted.afr.c_glusterfs-client-6=0x000000000000000000000000
trusted.afr.c_glusterfs-client-8=*0x000000060000000000000000** //because
client8 is the latest client in our case and starting 8 digits *
*00000006....are saying like there is something in changelog data. *
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001356d86c0c000217fd
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
# lhsh 002500 getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=*0x000000000000000000000000** // and
here we can say that there is no split brain but the file is out of sync*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001156d86c290005735c
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
# gluster volume info
Volume Name: c_glusterfs
Type: Replicate
Volume ID: c6a61455-d378-48bf-ad40-7a3ce897fc9c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 10.32.0.48:/opt/lvmdir/c2/brick
Brick2: 10.32.1.144:/opt/lvmdir/c2/brick
performance.readdir-ahead: on
network.ping-timeout: 4
nfs.disable: on
# gluster volume info
Volume Name: c_glusterfs
Type: Replicate
Volume ID: c6a61455-d378-48bf-ad40-7a3ce897fc9c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: 10.32.0.48:/opt/lvmdir/c2/brick
Brick2: 10.32.1.144:/opt/lvmdir/c2/brick
performance.readdir-ahead: on
network.ping-timeout: 4
nfs.disable: on
# gluster --version
glusterfs 3.7.8 built on Feb 17 2016 07:49:49
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <
<https://prod-webmail.windriver.com/owa/redir.aspx?SURL=1n3NinBc2tJluL9mRvtdRtuM7FXSFmZ7aHgTkNSgQ7vm1RuX9kPTCGgAdAB0AHAAOgAvAC8AdwB3AHcALgBnAGwAdQBzAHQAZQByAC4AYwBvAG0ALwA.&URL=http%3a%2f%2fwww.gluster.com%2f>
http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License.
# gluster volume heal info heal-failed
Usage: volume heal <VOLNAME> [enable | disable | full |statistics
[heal-count [replica <HOSTNAME:BRICKNAME>]] |info [healed | heal-failed |
split-brain] |split-brain {bigger-file <FILE> |source-brick
<HOSTNAME:BRICKNAME> [<FILE>]}]
# gluster volume heal c_glusterfs info heal-failed
Command not supported. Please use "gluster volume heal c_glusterfs info"
and logs to find the heal information.
# lhsh 002500
_______ _____ _____ _____ __ _ _ _ _ _
| |_____] |_____] | | | \ | | | \___/
|_____ | | |_____ __|__ | \_| |_____| _/ \_
002500> gluster --version
glusterfs 3.7.8 built on Feb 17 2016 07:49:49
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <
<https://prod-webmail.windriver.com/owa/redir.aspx?SURL=1n3NinBc2tJluL9mRvtdRtuM7FXSFmZ7aHgTkNSgQ7vm1RuX9kPTCGgAdAB0AHAAOgAvAC8AdwB3AHcALgBnAGwAdQBzAHQAZQByAC4AYwBvAG0ALwA.&URL=http%3a%2f%2fwww.gluster.com%2f>
http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU
General Public License.
002500>
Regards,
Abhishek
On Thu, Mar 3, 2016 at 4:54 PM, ABHISHEK PALIWAL <
Post by ABHISHEK PALIWAL
Post by ABHISHEK PALIWAL
Hi,
Hi Ravi,
As I discussed earlier this issue, I investigated this issue and find
that healing is not triggered because the "gluster volume heal c_glusterfs
info split-brain" command not showing any entries as a outcome of this
command even though the file in split brain case.
Couple of observations from the 'commands_output' file.
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000080000000000000000
trusted.afr.c_glusterfs-client-2=0x000000020000000000000000
trusted.afr.c_glusterfs-client-4=0x000000020000000000000000
trusted.afr.c_glusterfs-client-6=0x000000020000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
1. There doesn't seem to be a split-brain going by the trusted.afr* xattrs.
if it is not the split brain problem then how can I resolve this.
Post by ABHISHEK PALIWAL
2. You seem to have re-used the bricks from another volume/setup. For
replica 2, only trusted.afr.c_glusterfs-client-0 and
trusted.afr.c_glusterfs-client-1 must be present but I see 4 xattrs -
client-0,2,4 and 6
could you please suggest why these entries are there because I am not
able to find out scenario. I am rebooting the one board multiple times to
reproduce the issue and after every reboot doing the remove-brick and
add-brick on the same volume for the second board.
Post by ABHISHEK PALIWAL
3. On the rebooted node, do you have ssl enabled by any chance? There is
<https://bugzilla.redhat.com/show_bug.cgi?id=1258931>
https://bugzilla.redhat.com/show_bug.cgi?id=1258931
Btw, you for data and metadata split-brains you can use the gluster CLI
<https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md>
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
instead of modifying the file from the back end.
But you are saying it is not split brain problem and even the split-brain
command is not showing any file so how can I find the bigger file in size.
Also in my case the file size is fix 2MB it is overwritten every time.
Post by ABHISHEK PALIWAL
-Ravi
So, what I have done I manually deleted the gfid entry of that file from
.glusterfs directory and follow the instruction mentioned in the following
link to do heal
https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
and this works fine for me.
But my question is why the split-brain command not showing any file in
output.
Here I am attaching all the log which I get from the node for you and
also the output of commands from both of the boards
In this tar file two directories are present
000300 - log for the board which is running continuously
002500- log for the board which is rebooted
I am waiting for your reply please help me out on this issue.
Thanks in advanced.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL <
On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N <
Yes correct
Okay, so when you say the files are not in sync until some time, are
you getting stale data when accessing from the mount?
I'm not able to figure out why heal info shows zero when the files are
not in sync, despite all IO happening from the mounts. Could you provide
the output of getfattr -d -m . -e hex /brick/file-name from both bricks
when you hit this issue?
I'll provide the logs once I get. here delay means we are powering on
the second board after the 10 minutes.
Hello,
Hi Ravi,
Thanks for the response.
We are using Glugsterfs-3.7.8
We have a logging file which saves logs of the events for every board
of a node and these files are in sync using glusterfs. System in replica 2
mode it means When one brick in a replicated volume goes offline,
the glusterd daemons on the other nodes keep track of all the files that
are not replicated to the offline brick. When the offline brick becomes
available again, the cluster initiates a healing process, replicating the
updated files to that brick. But in our casse, we see that log file
of one board is not in the sync and its format is corrupted means files are
not in sync.
Just to understand you correctly, you have mounted the 2 node
replica-2 volume on both these nodes and writing to a logging file from the
mounts right?
Even the outcome of #gluster volume heal c_glusterfs info shows that
there is no pending heals.
Also , The logging file which is updated is of fixed size and the new
entries will be wrapped ,overwriting the old entries.
This way we have seen that after few restarts , the contents of the
same file on two bricks are different , but the volume heal info shows zero
entries
But when we tried to put delay > 5 min before the healing
everything is working fine.
Regards,
Abhishek
On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N <
Hi,
Here, I have one query regarding the time taken by the healing process.
In current two node setup when we rebooted one node then the
self-healing process starts less than 5min interval on the board which
resulting the corruption of the some files data.
Heal should start immediately after the brick process comes up. What
version of gluster are you using? What do you mean by corruption of data?
Also, how did you observe that the heal started after 5 minutes?
-Ravi
<https://support.rackspace.com/how-to/glusterfs-troubleshooting/>
https://support.rackspace.com/how-to/glusterfs-troubleshooting/
Mentioning that the healing process can takes upto 10min of time to
start this process.
"Healing replicated volumes
When any brick in a replicated volume goes offline, the glusterd
daemons on the remaining nodes keep track of all the files that are not
replicated to the offline brick. When the offline brick becomes available
again, the cluster initiates a healing process, replicating the updated
files to that brick. *The start of this process can take up to 10
minutes, based on observation.*"
After giving the time of more than 5 min file corruption problem has
been resolved.
So, Here my question is there any way through which we can reduce
the time taken by the healing process to start?
Regards,
Abhishek Paliwal
_______________________________________________
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
Ravishankar N
2016-03-04 13:06:08 UTC
Permalink
Post by Ravishankar N
Ok, just to confirm, glusterd and other brick processes are
running after this node rebooted?
When you run the above command, you need to check
/var/log/glusterfs/glfsheal-volname.log logs errros. Setting
client-log-level to DEBUG would give you a more verbose message
Yes, glusterd and other brick processes running fine. I have check the
/var/log/glusterfs/glfsheal-volname.log file without the log-level=
DEBUG. Here is the logs from that file
[2016-03-02 13:51:39.059440] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started
thread with index 1
[2016-03-02 13:51:39.072172] W [MSGID: 101012]
[common-utils.c:2776:gf_get_reserved_ports] 0-glusterfs: could not
open the file /proc/sys/net/ipv4/ip_local_reserved_ports for getting
reserved ports info [No such file or directory]
[2016-03-02 13:51:39.072228] W [MSGID: 101081]
[common-utils.c:2810:gf_process_reserved_ports] 0-glusterfs: Not able
to get reserved ports, hence there is a possibility that glusterfs may
consume reserved port
[2016-03-02 13:51:39.072583] E [socket.c:2278:socket_connect_finish]
0-gfapi: connection to 127.0.0.1:24007 <http://127.0.0.1:24007> failed
(Connection refused)
Not sure why ^^ occurs. You could try flushing iptables (iptables -F),
restart glusterd and run the heal info command again .
Post by Ravishankar N
[2016-03-02 13:51:39.072663] E [MSGID: 104024]
[glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with
remote-host: localhost (Transport endpoint is not connected)
[Transport endpoint is not connected]
[2016-03-02 13:51:39.072700] I [MSGID: 104025]
[glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile
servers [Transport endpoint is not connected]
Post by ABHISHEK PALIWAL
# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
And based on the your observation I understood that this is not
the problem of split-brain but *is there any way through which
can find out the file which is not in split-brain as well as not
in sync?*
`gluster volume heal c_glusterfs info split-brain` should give
you files that need heal.
Sorry I meant 'gluster volume heal c_glusterfs info' should give you
the files that need heal and 'gluster volume heal c_glusterfs info
split-brain' the list of files in split-brain.
The commands are detailed in
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
Post by Ravishankar N
I have run "gluster volume heal c_glusterfs info split-brain" command
but it is not showing that file which is out of sync that is the issue
file is not in sync on both of the brick and split-brain is not
showing that command in output for heal required.
Thats is why I am asking that is there any command other than this
split brain command so that I can find out the files those are
required the heal operation but not displayed in the output of
"gluster volume heal c_glusterfs info split-brain" command.
ABHISHEK PALIWAL
2016-03-04 13:30:03 UTC
Permalink
Post by ABHISHEK PALIWAL
Post by Ravishankar N
Ok, just to confirm, glusterd and other brick processes are running
after this node rebooted?
When you run the above command, you need to check
/var/log/glusterfs/glfsheal-volname.log logs errros. Setting
client-log-level to DEBUG would give you a more verbose message
Yes, glusterd and other brick processes running fine. I have check the
/var/log/glusterfs/glfsheal-volname.log file without the log-level= DEBUG.
Here is the logs from that file
[2016-03-02 13:51:39.059440] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2016-03-02 13:51:39.072172] W [MSGID: 101012]
[common-utils.c:2776:gf_get_reserved_ports] 0-glusterfs: could not open the
file /proc/sys/net/ipv4/ip_local_reserved_ports for getting reserved ports
info [No such file or directory]
[2016-03-02 13:51:39.072228] W [MSGID: 101081]
[common-utils.c:2810:gf_process_reserved_ports] 0-glusterfs: Not able to
get reserved ports, hence there is a possibility that glusterfs may consume
reserved port
[2016-03-02 13:51:39.072583] E [socket.c:2278:socket_connect_finish]
0-gfapi: connection to 127.0.0.1:24007 failed (Connection refused)
Not sure why ^^ occurs. You could try flushing iptables (iptables -F),
restart glusterd and run the heal info command again .
No hint from the logs? I'll try your suggestion.
Post by ABHISHEK PALIWAL
[2016-03-02 13:51:39.072663] E [MSGID: 104024]
[glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with
remote-host: localhost (Transport endpoint is not connected) [Transport
endpoint is not connected]
[2016-03-02 13:51:39.072700] I [MSGID: 104025]
[glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile
servers [Transport endpoint is not connected]
Post by Ravishankar N
# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
And based on the your observation I understood that this is not the
problem of split-brain but *is there any way through which can find out
the file which is not in split-brain as well as not in sync?*
`gluster volume heal c_glusterfs info split-brain` should give you files
that need heal.
Sorry I meant 'gluster volume heal c_glusterfs info' should give you the
files that need heal and 'gluster volume heal c_glusterfs info
split-brain' the list of files in split-brain.
The commands are detailed in
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
Yes, I have tried this as well It is also giving Number of entries : 0
means no healing is required but the file
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml is
not in sync both of brick showing the different version of this file.

You can see it in the getfattr command outcome as well.


# getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
# file:
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000000000000000000000
trusted.afr.c_glusterfs-client-2=0x000000000000000000000000
trusted.afr.c_glusterfs-client-4=0x000000000000000000000000
trusted.afr.c_glusterfs-client-6=0x000000000000000000000000
trusted.afr.c_glusterfs-client-8=*0x000000060000000000000000** //because
client8 is the latest client in our case and starting 8 digits *

*00000006....are saying like there is something in changelog data.*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001356d86c0c000217fd
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae

# lhsh 002500 getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
# file:
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=*0x000000000000000000000000** // and here
we can say that there is no split brain but the file is out of sync*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001156d86c290005735c
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
Post by ABHISHEK PALIWAL
Regards,
Abhishek
ABHISHEK PALIWAL
2016-03-14 05:06:15 UTC
Permalink
Hi Ravishankar,

I just want to inform that this file have some different properties from
other files like this is the file which having the fixed size and when
there is no space in file the next data will start wrapping from the top of
the file.

Means in this file we are doing the wrapping of the data as well.

So, I just want to know is this feature of file will effect gluster to
identify the split-brain or xattr attributes?

Regards,
Abhishek
Post by ABHISHEK PALIWAL
Post by ABHISHEK PALIWAL
Post by Ravishankar N
Ok, just to confirm, glusterd and other brick processes are running
after this node rebooted?
When you run the above command, you need to check
/var/log/glusterfs/glfsheal-volname.log logs errros. Setting
client-log-level to DEBUG would give you a more verbose message
Yes, glusterd and other brick processes running fine. I have check the
/var/log/glusterfs/glfsheal-volname.log file without the log-level= DEBUG.
Here is the logs from that file
[2016-03-02 13:51:39.059440] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2016-03-02 13:51:39.072172] W [MSGID: 101012]
[common-utils.c:2776:gf_get_reserved_ports] 0-glusterfs: could not open the
file /proc/sys/net/ipv4/ip_local_reserved_ports for getting reserved ports
info [No such file or directory]
[2016-03-02 13:51:39.072228] W [MSGID: 101081]
[common-utils.c:2810:gf_process_reserved_ports] 0-glusterfs: Not able to
get reserved ports, hence there is a possibility that glusterfs may consume
reserved port
[2016-03-02 13:51:39.072583] E [socket.c:2278:socket_connect_finish]
0-gfapi: connection to 127.0.0.1:24007 failed (Connection refused)
Not sure why ^^ occurs. You could try flushing iptables (iptables -F),
restart glusterd and run the heal info command again .
No hint from the logs? I'll try your suggestion.
Post by ABHISHEK PALIWAL
[2016-03-02 13:51:39.072663] E [MSGID: 104024]
[glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with
remote-host: localhost (Transport endpoint is not connected) [Transport
endpoint is not connected]
[2016-03-02 13:51:39.072700] I [MSGID: 104025]
[glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile
servers [Transport endpoint is not connected]
Post by Ravishankar N
# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
And based on the your observation I understood that this is not the
problem of split-brain but *is there any way through which can find out
the file which is not in split-brain as well as not in sync?*
`gluster volume heal c_glusterfs info split-brain` should give you
files that need heal.
Sorry I meant 'gluster volume heal c_glusterfs info' should give you
the files that need heal and 'gluster volume heal c_glusterfs info
split-brain' the list of files in split-brain.
The commands are detailed in
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
Yes, I have tried this as well It is also giving Number of entries : 0
means no healing is required but the file
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml is
not in sync both of brick showing the different version of this file.
You can see it in the getfattr command outcome as well.
# getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000000000000000000000
trusted.afr.c_glusterfs-client-2=0x000000000000000000000000
trusted.afr.c_glusterfs-client-4=0x000000000000000000000000
trusted.afr.c_glusterfs-client-6=0x000000000000000000000000
trusted.afr.c_glusterfs-client-8=*0x000000060000000000000000** //because
client8 is the latest client in our case and starting 8 digits *
*00000006....are saying like there is something in changelog data.*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001356d86c0c000217fd
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
# lhsh 002500 getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=*0x000000000000000000000000** // and
here we can say that there is no split brain but the file is out of sync*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001156d86c290005735c
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
Post by ABHISHEK PALIWAL
Regards,
Abhishek
--
Regards
Abhishek Paliwal
Ravishankar N
2016-03-14 08:07:43 UTC
Permalink
Post by ABHISHEK PALIWAL
Hi Ravishankar,
I just want to inform that this file have some different properties
from other files like this is the file which having the fixed size and
when there is no space in file the next data will start wrapping from
the top of the file.
Means in this file we are doing the wrapping of the data as well.
So, I just want to know is this feature of file will effect gluster to
identify the split-brain or xattr attributes?
Hi,
No it shouldn't matter at what offset the writes happen. The xattrs only
track that the write was missed (and therefore a pending heal),
irrespective of (offset, length).
Ravi
Post by ABHISHEK PALIWAL
Regards,
Abhishek
On Fri, Mar 4, 2016 at 7:00 PM, ABHISHEK PALIWAL
On Fri, Mar 4, 2016 at 6:36 PM, Ravishankar N
Post by Ravishankar N
Ok, just to confirm, glusterd and other brick processes
are running after this node rebooted?
When you run the above command, you need to check
/var/log/glusterfs/glfsheal-volname.log logs errros.
Setting client-log-level to DEBUG would give you a more
verbose message
Yes, glusterd and other brick processes running fine. I have
check the /var/log/glusterfs/glfsheal-volname.log file
without the log-level= DEBUG. Here is the logs from that file
[2016-03-02 13:51:39.059440] I [MSGID: 101190]
Started thread with index 1
[2016-03-02 13:51:39.072172] W [MSGID: 101012]
could not open the file
/proc/sys/net/ipv4/ip_local_reserved_ports for getting
reserved ports info [No such file or directory]
[2016-03-02 13:51:39.072228] W [MSGID: 101081]
Not able to get reserved ports, hence there is a possibility
that glusterfs may consume reserved port
[2016-03-02 13:51:39.072583] E
[socket.c:2278:socket_connect_finish] 0-gfapi: connection to
127.0.0.1:24007 <http://127.0.0.1:24007> failed (Connection
refused)
Not sure why ^^ occurs. You could try flushing iptables
(iptables -F), restart glusterd and run the heal info command
again .
No hint from the logs? I'll try your suggestion.
Post by Ravishankar N
[2016-03-02 13:51:39.072663] E [MSGID: 104024]
[glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to
connect with remote-host: localhost (Transport endpoint is
not connected) [Transport endpoint is not connected]
[2016-03-02 13:51:39.072700] I [MSGID: 104025]
[glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all
volfile servers [Transport endpoint is not connected]
Post by ABHISHEK PALIWAL
# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
And based on the your observation I understood that this
is not the problem of split-brain but *is there any way
through which can find out the file which is not in
split-brain as well as not in sync?*
`gluster volume heal c_glusterfs info split-brain`
should give you files that need heal.
Sorry I meant 'gluster volume heal c_glusterfs info' should
give you the files that need heal and 'gluster volume heal
c_glusterfs info split-brain' the list of files in split-brain.
The commands are detailed in
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
Yes, I have tried this as well It is also giving Number of entries
: 0 means no healing is required but the file
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
is not in sync both of brick showing the different version of this
file.
You can see it in the getfattr command outcome as well.
# getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000000000000000000000
trusted.afr.c_glusterfs-client-2=0x000000000000000000000000
trusted.afr.c_glusterfs-client-4=0x000000000000000000000000
trusted.afr.c_glusterfs-client-6=0x000000000000000000000000
trusted.afr.c_glusterfs-client-8=*0x000000060000000000000000**//because
client8 is the latest client in our case and starting 8 digits **
*
*00000006....are saying like there is something in changelog data. *
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001356d86c0c000217fd
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
# lhsh 002500 getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=*0x000000000000000000000000**//
and here we can say that there is no split brain but the file is
out of sync*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001156d86c290005735c
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
Regards,
Abhishek
--
Regards
Abhishek Paliwal
ABHISHEK PALIWAL
2016-03-14 09:33:07 UTC
Permalink
Then how can I resolved this issue?
Post by ABHISHEK PALIWAL
Hi Ravishankar,
I just want to inform that this file have some different properties from
other files like this is the file which having the fixed size and when
there is no space in file the next data will start wrapping from the top of
the file.
Means in this file we are doing the wrapping of the data as well.
So, I just want to know is this feature of file will effect gluster to
identify the split-brain or xattr attributes?
Hi,
No it shouldn't matter at what offset the writes happen. The xattrs only
track that the write was missed (and therefore a pending heal),
irrespective of (offset, length).
Ravi
Regards,
Abhishek
On Fri, Mar 4, 2016 at 7:00 PM, ABHISHEK PALIWAL <
Post by ABHISHEK PALIWAL
Post by ABHISHEK PALIWAL
Post by Ravishankar N
Ok, just to confirm, glusterd and other brick processes are running
after this node rebooted?
When you run the above command, you need to check
/var/log/glusterfs/glfsheal-volname.log logs errros. Setting
client-log-level to DEBUG would give you a more verbose message
Yes, glusterd and other brick processes running fine. I have check the
/var/log/glusterfs/glfsheal-volname.log file without the log-level= DEBUG.
Here is the logs from that file
[2016-03-02 13:51:39.059440] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2016-03-02 13:51:39.072172] W [MSGID: 101012]
[common-utils.c:2776:gf_get_reserved_ports] 0-glusterfs: could not open the
file /proc/sys/net/ipv4/ip_local_reserved_ports for getting reserved ports
info [No such file or directory]
[2016-03-02 13:51:39.072228] W [MSGID: 101081]
[common-utils.c:2810:gf_process_reserved_ports] 0-glusterfs: Not able to
get reserved ports, hence there is a possibility that glusterfs may consume
reserved port
[2016-03-02 13:51:39.072583] E [socket.c:2278:socket_connect_finish]
0-gfapi: connection to 127.0.0.1:24007 failed (Connection refused)
Not sure why ^^ occurs. You could try flushing iptables (iptables -F),
restart glusterd and run the heal info command again .
No hint from the logs? I'll try your suggestion.
Post by ABHISHEK PALIWAL
[2016-03-02 13:51:39.072663] E [MSGID: 104024]
[glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with
remote-host: localhost (Transport endpoint is not connected) [Transport
endpoint is not connected]
[2016-03-02 13:51:39.072700] I [MSGID: 104025]
[glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile
servers [Transport endpoint is not connected]
Post by Ravishankar N
# gluster volume heal c_glusterfs info split-brain
c_glusterfs: Not able to fetch volfile from glusterd
Volume heal failed.
And based on the your observation I understood that this is not the
problem of split-brain but *is there any way through which can find
out the file which is not in split-brain as well as not in sync?*
`gluster volume heal c_glusterfs info split-brain` should give you
files that need heal.
Sorry I meant 'gluster volume heal c_glusterfs info' should give you
the files that need heal and 'gluster volume heal c_glusterfs info
split-brain' the list of files in split-brain.
The commands are detailed in
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
Yes, I have tried this as well It is also giving Number of entries : 0
means no healing is required but the file
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
is not in sync both of brick showing the different version of this file.
You can see it in the getfattr command outcome as well.
# getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000000000000000000000
trusted.afr.c_glusterfs-client-2=0x000000000000000000000000
trusted.afr.c_glusterfs-client-4=0x000000000000000000000000
trusted.afr.c_glusterfs-client-6=0x000000000000000000000000
trusted.afr.c_glusterfs-client-8=*0x000000060000000000000000** //because
client8 is the latest client in our case and starting 8 digits *
*00000006....are saying like there is something in changelog data. *
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001356d86c0c000217fd
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
# lhsh 002500 getfattr -m . -d -e hex
/opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
getfattr: Removing leading '/' from absolute path names
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=*0x000000000000000000000000** // and
here we can say that there is no split brain but the file is out of sync*
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000001156d86c290005735c
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
Post by ABHISHEK PALIWAL
Regards,
Abhishek
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
Continue reading on narkive:
Loading...