Discussion:
gfid and volume-id extended attributes lost
Add Reply
Ankireddypalle Reddy
2017-07-07 15:09:39 UTC
Reply
Permalink
Raw Message
Hi,
We faced an issue in the production today. We had to stop the volume and reboot all the servers in the cluster. Once the servers rebooted starting of the volume failed because the following extended attributes were not present on all the bricks on 2 servers.

1) trusted.gfid

2) trusted.glusterfs.volume-id

We had to manually set these extended attributes to start the volume. Are there any such known issues.

Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
Pranith Kumar Karampuri
2017-07-07 15:45:34 UTC
Reply
Permalink
Raw Message
Did anything special happen on these two bricks? It can't happen in the I/O
path:
posix_removexattr() has:
0 if (!strcmp (GFID_XATTR_KEY, name))
{

1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s",
real_path);
3 op_ret =
-1;
4 goto
out;
5
}
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file
%s",
9
real_path);
10 op_ret =
-1;
11 goto
out;
12 }

I just found that op_errno is not set correctly, but it can't happen in the
I/O path, so self-heal/rebalance are off the hook.

I also grepped for any removexattr of trusted.gfid from glusterd and didn't
find any.

So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
Post by Ankireddypalle Reddy
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume. Are
there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
Pranith Kumar Karampuri
2017-07-07 15:47:25 UTC
Reply
Permalink
Raw Message
Post by Pranith Kumar Karampuri
Did anything special happen on these two bricks? It can't happen in the
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s",
real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file
%s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and
didn't find any.
So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
Fix for this was to mount the bricks. But considering that you guys did
setting of the xattrs instead, I am guessing the other data was intact and
only these particular xattrs were missing? I wonder what new problem this
is.
Post by Pranith Kumar Karampuri
Post by Ankireddypalle Reddy
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume.
Are there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
--
Pranith
Ankireddypalle Reddy
2017-07-07 15:50:54 UTC
Reply
Permalink
Raw Message
Pranith,
Thanks for looking in to the issue. The bricks were mounted after the reboot. One more thing that I noticed was when the attributes were manually set when glusterd was up then on starting the volume the attributes were again lost. Had to stop glusterd set attributes and then start glusterd. After that the volume start succeeded.

Thanks and Regards,
Ram

From: Pranith Kumar Karampuri [mailto:***@redhat.com]
Sent: Friday, July 07, 2017 11:46 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org); gluster-***@gluster.org
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Did anything special happen on these two bricks? It can't happen in the I/O path:
posix_removexattr() has:
0 if (!strcmp (GFID_XATTR_KEY, name)) {
1 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name)) {
7 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot, the brick mounts wouldn't happen and this would lead to absence of both trusted.gfid and volume-id. So at the moment this is my wild guess.


On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Hi,
We faced an issue in the production today. We had to stop the volume and reboot all the servers in the cluster. Once the servers rebooted starting of the volume failed because the following extended attributes were not present on all the bricks on 2 servers.

1) trusted.gfid

2) trusted.glusterfs.volume-id

We had to manually set these extended attributes to start the volume. Are there any such known issues.

Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

_______________________________________________
Gluster-devel mailing list
Gluster-***@gluster.org<mailto:Gluster-***@gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
Pranith Kumar Karampuri
2017-07-07 15:53:49 UTC
Reply
Permalink
Raw Message
Post by Ankireddypalle Reddy
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Post by Ankireddypalle Reddy
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume. Are
there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
Ankireddypalle Reddy
2017-07-07 15:55:31 UTC
Reply
Permalink
Raw Message
3.7.19

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com]
Sent: Friday, July 07, 2017 11:54 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org); gluster-***@gluster.org
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost



On Fri, Jul 7, 2017 at 9:20 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Pranith,
Thanks for looking in to the issue. The bricks were mounted after the reboot. One more thing that I noticed was when the attributes were manually set when glusterd was up then on starting the volume the attributes were again lost. Had to stop glusterd set attributes and then start glusterd. After that the volume start succeeded.

Which version is this?


Thanks and Regards,
Ram

From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 11:46 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Did anything special happen on these two bricks? It can't happen in the I/O path:
posix_removexattr() has:
0 if (!strcmp (GFID_XATTR_KEY, name)) {
1 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name)) {
7 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot, the brick mounts wouldn't happen and this would lead to absence of both trusted.gfid and volume-id. So at the moment this is my wild guess.


On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Hi,
We faced an issue in the production today. We had to stop the volume and reboot all the servers in the cluster. Once the servers rebooted starting of the volume failed because the following extended attributes were not present on all the bricks on 2 servers.

1) trusted.gfid

2) trusted.glusterfs.volume-id

We had to manually set these extended attributes to start the volume. Are there any such known issues.

Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

_______________________________________________
Gluster-devel mailing list
Gluster-***@gluster.org<mailto:Gluster-***@gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
Pranith Kumar Karampuri
2017-07-07 16:14:47 UTC
Reply
Permalink
Raw Message
Post by Ankireddypalle Reddy
3.7.19
These are the only callers for removexattr and only _posix_remove_xattr has
the potential to do removexattr as posix_removexattr already makes sure
that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr
happens only from healing code of afr/ec. And this can only happen if the
source brick doesn't have gfid, which doesn't seem to match with the
situation you explained.

# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c
<<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h <<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");

So there are only two possibilities:
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.

What is your volume info? May be that will give more clues.

PS: sys_fremovexattr is called only from posix_fremovexattr(), so that
doesn't seem to be the culprit as it also have checks to guard against
gfid/volume-id removal.
Post by Ankireddypalle Reddy
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:54 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume. Are
there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
Ankireddypalle Reddy
2017-07-07 21:28:36 UTC
Reply
Permalink
Raw Message
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.

[***@glusterfs2 Log_Files]# gluster volume info

Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Bricks:
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
Options Reconfigured:
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.commvault.com,glusterfs5sds.commvault.com,glusterfs6sds.commvault.com

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com]
Sent: Friday, July 07, 2017 12:15 PM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org); gluster-***@gluster.org
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost



On Fri, Jul 7, 2017 at 9:25 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
3.7.19

These are the only callers for removexattr and only _posix_remove_xattr has the potential to do removexattr as posix_removexattr already makes sure that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr happens only from healing code of afr/ec. And this can only happen if the source brick doesn't have gfid, which doesn't seem to match with the situation you explained.

# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c <<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c <<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c <<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h <<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
So there are only two possibilities:
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.

PS: sys_fremovexattr is called only from posix_fremovexattr(), so that doesn't seem to be the culprit as it also have checks to guard against gfid/volume-id removal.

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 11:54 AM

To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost



On Fri, Jul 7, 2017 at 9:20 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Pranith,
Thanks for looking in to the issue. The bricks were mounted after the reboot. One more thing that I noticed was when the attributes were manually set when glusterd was up then on starting the volume the attributes were again lost. Had to stop glusterd set attributes and then start glusterd. After that the volume start succeeded.

Which version is this?


Thanks and Regards,
Ram

From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 11:46 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Did anything special happen on these two bricks? It can't happen in the I/O path:
posix_removexattr() has:
0 if (!strcmp (GFID_XATTR_KEY, name)) {
1 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name)) {
7 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot, the brick mounts wouldn't happen and this would lead to absence of both trusted.gfid and volume-id. So at the moment this is my wild guess.


On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Hi,
We faced an issue in the production today. We had to stop the volume and reboot all the servers in the cluster. Once the servers rebooted starting of the volume failed because the following extended attributes were not present on all the bricks on 2 servers.

1) trusted.gfid

2) trusted.glusterfs.volume-id

We had to manually set these extended attributes to start the volume. Are there any such known issues.

Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

_______________________________________________
Gluster-devel mailing list
Gluster-***@gluster.org<mailto:Gluster-***@gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
Vijay Bellur
2017-07-07 23:33:44 UTC
Reply
Permalink
Raw Message
Do you observe any event pattern (self-healing / disk failures / reboots
etc.) after which the extended attributes are missing?

Regards,
Vijay
Post by Ankireddypalle Reddy
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.
Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.
commvault.com,glusterfs5sds.commvault.com,glusterfs6sds.commvault.com
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 12:15 PM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
3.7.19
These are the only callers for removexattr and only _posix_remove_xattr
has the potential to do removexattr as posix_removexattr already makes sure
that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr
happens only from healing code of afr/ec. And this can only happen if the
source brick doesn't have gfid, which doesn't seem to match with the
situation you explained.
# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c
<<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h
<<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.
PS: sys_fremovexattr is called only from posix_fremovexattr(), so that
doesn't seem to be the culprit as it also have checks to guard against
gfid/volume-id removal.
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:54 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume. Are
there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Pranith Kumar Karampuri
2017-07-08 01:36:08 UTC
Reply
Permalink
Raw Message
Ram,
As per the code, self-heal was the only candidate which *can* do it.
Could you check logs of self-heal daemon and the mount to check if there
are any metadata heals on root?


+Sanoj

Sanoj,
Is there any systemtap script we can use to detect which process is
removing these xattrs?
Post by Ankireddypalle Reddy
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.
Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.
commvault.com,glusterfs5sds.commvault.com,glusterfs6sds.commvault.com
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 12:15 PM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
3.7.19
These are the only callers for removexattr and only _posix_remove_xattr
has the potential to do removexattr as posix_removexattr already makes sure
that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr
happens only from healing code of afr/ec. And this can only happen if the
source brick doesn't have gfid, which doesn't seem to match with the
situation you explained.
# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c
<<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h
<<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.
PS: sys_fremovexattr is called only from posix_fremovexattr(), so that
doesn't seem to be the culprit as it also have checks to guard against
gfid/volume-id removal.
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:54 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume. Are
there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
Sanoj Unnikrishnan
2017-07-10 09:26:51 UTC
Reply
Permalink
Raw Message
@ pranith , yes . we can get the pid on all removexattr call and also print
the backtrace of the glusterfsd process when trigerring removing xattr.
I will write the script and reply back.
Post by Pranith Kumar Karampuri
Ram,
As per the code, self-heal was the only candidate which *can* do
it. Could you check logs of self-heal daemon and the mount to check if
there are any metadata heals on root?
+Sanoj
Sanoj,
Is there any systemtap script we can use to detect which process is
removing these xattrs?
Post by Ankireddypalle Reddy
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.
Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.comm
vault.com,glusterfs5sds.commvault.com,glusterfs6sds.commvault.com
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 12:15 PM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
On Fri, Jul 7, 2017 at 9:25 PM, Ankireddypalle Reddy <
3.7.19
These are the only callers for removexattr and only _posix_remove_xattr
has the potential to do removexattr as posix_removexattr already makes sure
that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr
happens only from healing code of afr/ec. And this can only happen if the
source brick doesn't have gfid, which doesn't seem to match with the
situation you explained.
# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c
<<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h
<<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.
PS: sys_fremovexattr is called only from posix_fremovexattr(), so that
doesn't seem to be the culprit as it also have checks to guard against
gfid/volume-id removal.
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:54 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
On Fri, Jul 7, 2017 at 9:20 PM, Ankireddypalle Reddy <
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume.
Are there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
Sanoj Unnikrishnan
2017-07-10 11:49:22 UTC
Reply
Permalink
Raw Message
Please use the systemtap script(
https://paste.fedoraproject.org/paste/EGDa0ErwX0LV3y-gBYpfNA) to check
which process is invoking remove xattr calls.
It prints the pid, tid and arguments of all removexattr calls.
I have checked for these fops at the protocol/client and posix translators.

To run the script ..
1) install systemtap and dependencies.
2) install glusterfs-debuginfo
3) change the path of the translator in the systemtap script to appropriate
values for your system
(change "/usr/lib64/glusterfs/3.12dev/xlator/protocol/client.so" and
"/usr/lib64/glusterfs/3.12dev/xlator/storage/posix.so")
4) run the script as follows
#stap -v fop_trace.stp

The o/p would look like these .. additionally arguments will also be dumped
if glusterfs-debuginfo is also installed (i had not done it here.)
pid-958: 0 glusterfsd(3893):->posix_setxattr
pid-958: 47 glusterfsd(3893):<-posix_setxattr
pid-966: 0 glusterfsd(5033):->posix_setxattr
pid-966: 57 glusterfsd(5033):<-posix_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 37 glusterfs(1431):<-client_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 41 glusterfs(1431):<-client_setxattr

Regards,
Sanoj
Post by Sanoj Unnikrishnan
@ pranith , yes . we can get the pid on all removexattr call and also
print the backtrace of the glusterfsd process when trigerring removing
xattr.
I will write the script and reply back.
On Sat, Jul 8, 2017 at 7:06 AM, Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Ram,
As per the code, self-heal was the only candidate which *can* do
it. Could you check logs of self-heal daemon and the mount to check if
there are any metadata heals on root?
+Sanoj
Sanoj,
Is there any systemtap script we can use to detect which process
is removing these xattrs?
On Sat, Jul 8, 2017 at 2:58 AM, Ankireddypalle Reddy <
Post by Ankireddypalle Reddy
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.
Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.comm
vault.com,glusterfs5sds.commvault.com,glusterfs6sds.commvault.com
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 12:15 PM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
On Fri, Jul 7, 2017 at 9:25 PM, Ankireddypalle Reddy <
3.7.19
These are the only callers for removexattr and only _posix_remove_xattr
has the potential to do removexattr as posix_removexattr already makes sure
that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr
happens only from healing code of afr/ec. And this can only happen if the
source brick doesn't have gfid, which doesn't seem to match with the
situation you explained.
# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c
<<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h
<<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.
PS: sys_fremovexattr is called only from posix_fremovexattr(), so that
doesn't seem to be the culprit as it also have checks to guard against
gfid/volume-id removal.
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:54 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
On Fri, Jul 7, 2017 at 9:20 PM, Ankireddypalle Reddy <
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines
reboot, the brick mounts wouldn't happen and this would lead to absence of
both trusted.gfid and volume-id. So at the moment this is my wild guess.
On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume.
Are there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
Pranith Kumar Karampuri
2017-07-10 12:30:50 UTC
Reply
Permalink
Raw Message
Ram,
If you see it again, you can use this. I am going to send out a patch
for the code path which can lead to removal of gfid/volume-id tomorrow.
Please use the systemtap script(https://paste.fedoraproject.org/paste/
EGDa0ErwX0LV3y-gBYpfNA) to check which process is invoking remove xattr
calls.
It prints the pid, tid and arguments of all removexattr calls.
I have checked for these fops at the protocol/client and posix translators.
To run the script ..
1) install systemtap and dependencies.
2) install glusterfs-debuginfo
3) change the path of the translator in the systemtap script to
appropriate values for your system
(change "/usr/lib64/glusterfs/3.12dev/xlator/protocol/client.so" and
"/usr/lib64/glusterfs/3.12dev/xlator/storage/posix.so")
4) run the script as follows
#stap -v fop_trace.stp
The o/p would look like these .. additionally arguments will also be
dumped if glusterfs-debuginfo is also installed (i had not done it here.)
pid-958: 0 glusterfsd(3893):->posix_setxattr
pid-958: 47 glusterfsd(3893):<-posix_setxattr
pid-966: 0 glusterfsd(5033):->posix_setxattr
pid-966: 57 glusterfsd(5033):<-posix_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 37 glusterfs(1431):<-client_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 41 glusterfs(1431):<-client_setxattr
Regards,
Sanoj
Post by Sanoj Unnikrishnan
@ pranith , yes . we can get the pid on all removexattr call and also
print the backtrace of the glusterfsd process when trigerring removing
xattr.
I will write the script and reply back.
On Sat, Jul 8, 2017 at 7:06 AM, Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Ram,
As per the code, self-heal was the only candidate which *can* do
it. Could you check logs of self-heal daemon and the mount to check if
there are any metadata heals on root?
+Sanoj
Sanoj,
Is there any systemtap script we can use to detect which process
is removing these xattrs?
On Sat, Jul 8, 2017 at 2:58 AM, Ankireddypalle Reddy <
Post by Ankireddypalle Reddy
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.
Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,
glusterfs4sds.commvault.com,glusterfs5sds.commvault.com,
glusterfs6sds.commvault.com
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 12:15 PM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
On Fri, Jul 7, 2017 at 9:25 PM, Ankireddypalle Reddy <
3.7.19
These are the only callers for removexattr and only _posix_remove_xattr
has the potential to do removexattr as posix_removexattr already makes sure
that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr
happens only from healing code of afr/ec. And this can only happen if the
source brick doesn't have gfid, which doesn't seem to match with the
situation you explained.
# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c
<<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h
<<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data,
"trusted.glusterfs.test");
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.
PS: sys_fremovexattr is called only from posix_fremovexattr(), so that
doesn't seem to be the culprit as it also have checks to guard against
gfid/volume-id removal.
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:54 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
On Fri, Jul 7, 2017 at 9:20 PM, Ankireddypalle Reddy <
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines
reboot, the brick mounts wouldn't happen and this would lead to absence of
both trusted.gfid and volume-id. So at the moment this is my wild guess.
On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume.
Are there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
--
Pranith
Ankireddypalle Reddy
2017-07-10 13:00:58 UTC
Reply
Permalink
Raw Message
Thanks for the swift turn around. Will try this out and let you know.

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com]
Sent: Monday, July 10, 2017 8:31 AM
To: Sanoj Unnikrishnan
Cc: Ankireddypalle Reddy; Gluster Devel (gluster-***@gluster.org); gluster-***@gluster.org
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Ram,
If you see it again, you can use this. I am going to send out a patch for the code path which can lead to removal of gfid/volume-id tomorrow.

On Mon, Jul 10, 2017 at 5:19 PM, Sanoj Unnikrishnan <***@redhat.com<mailto:***@redhat.com>> wrote:
Please use the systemtap script(https://paste.fedoraproject.org/paste/EGDa0ErwX0LV3y-gBYpfNA) to check which process is invoking remove xattr calls.
It prints the pid, tid and arguments of all removexattr calls.
I have checked for these fops at the protocol/client and posix translators.

To run the script ..
1) install systemtap and dependencies.
2) install glusterfs-debuginfo
3) change the path of the translator in the systemtap script to appropriate values for your system
(change "/usr/lib64/glusterfs/3.12dev/xlator/protocol/client.so" and "/usr/lib64/glusterfs/3.12dev/xlator/storage/posix.so")
4) run the script as follows
#stap -v fop_trace.stp

The o/p would look like these .. additionally arguments will also be dumped if glusterfs-debuginfo is also installed (i had not done it here.)
pid-958: 0 glusterfsd(3893):->posix_setxattr
pid-958: 47 glusterfsd(3893):<-posix_setxattr
pid-966: 0 glusterfsd(5033):->posix_setxattr
pid-966: 57 glusterfsd(5033):<-posix_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 37 glusterfs(1431):<-client_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 41 glusterfs(1431):<-client_setxattr
Regards,
Sanoj



On Mon, Jul 10, 2017 at 2:56 PM, Sanoj Unnikrishnan <***@redhat.com<mailto:***@redhat.com>> wrote:
@ pranith , yes . we can get the pid on all removexattr call and also print the backtrace of the glusterfsd process when trigerring removing xattr.
I will write the script and reply back.

On Sat, Jul 8, 2017 at 7:06 AM, Pranith Kumar Karampuri <***@redhat.com<mailto:***@redhat.com>> wrote:
Ram,
As per the code, self-heal was the only candidate which *can* do it. Could you check logs of self-heal daemon and the mount to check if there are any metadata heals on root?

+Sanoj
Sanoj,
Is there any systemtap script we can use to detect which process is removing these xattrs?

On Sat, Jul 8, 2017 at 2:58 AM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.

[***@glusterfs2 Log_Files]# gluster volume info

Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Bricks:
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
Options Reconfigured:
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.commvault.com<http://glusterfs4sds.commvault.com>,glusterfs5sds.commvault.com<http://glusterfs5sds.commvault.com>,glusterfs6sds.commvault.com<http://glusterfs6sds.commvault.com>

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 12:15 PM

To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost



On Fri, Jul 7, 2017 at 9:25 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
3.7.19

These are the only callers for removexattr and only _posix_remove_xattr has the potential to do removexattr as posix_removexattr already makes sure that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr happens only from healing code of afr/ec. And this can only happen if the source brick doesn't have gfid, which doesn't seem to match with the situation you explained.

# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c <<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c <<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c <<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h <<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
So there are only two possibilities:
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.

PS: sys_fremovexattr is called only from posix_fremovexattr(), so that doesn't seem to be the culprit as it also have checks to guard against gfid/volume-id removal.

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 11:54 AM

To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost



On Fri, Jul 7, 2017 at 9:20 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Pranith,
Thanks for looking in to the issue. The bricks were mounted after the reboot. One more thing that I noticed was when the attributes were manually set when glusterd was up then on starting the volume the attributes were again lost. Had to stop glusterd set attributes and then start glusterd. After that the volume start succeeded.

Which version is this?


Thanks and Regards,
Ram

From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 11:46 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Did anything special happen on these two bricks? It can't happen in the I/O path:
posix_removexattr() has:
0 if (!strcmp (GFID_XATTR_KEY, name)) {
1 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name)) {
7 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot, the brick mounts wouldn't happen and this would lead to absence of both trusted.gfid and volume-id. So at the moment this is my wild guess.


On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Hi,
We faced an issue in the production today. We had to stop the volume and reboot all the servers in the cluster. Once the servers rebooted starting of the volume failed because the following extended attributes were not present on all the bricks on 2 servers.

1) trusted.gfid

2) trusted.glusterfs.volume-id

We had to manually set these extended attributes to start the volume. Are there any such known issues.

Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

_______________________________________________
Gluster-devel mailing list
Gluster-***@gluster.org<mailto:Gluster-***@gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
Pranith Kumar Karampuri
2017-07-13 08:13:28 UTC
Reply
Permalink
Raw Message
Ram,
I sent https://review.gluster.org/17765 to fix the possibility in
bulk removexattr. But I am not sure if this is indeed the reason for this
issue.
Post by Ankireddypalle Reddy
Thanks for the swift turn around. Will try this out and let you know.
Thanks and Regards,
Ram
*Sent:* Monday, July 10, 2017 8:31 AM
*To:* Sanoj Unnikrishnan
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
Ram,
If you see it again, you can use this. I am going to send out a
patch for the code path which can lead to removal of gfid/volume-id
tomorrow.
Please use the systemtap script(https://paste.fedoraproject.org/paste/
EGDa0ErwX0LV3y-gBYpfNA) to check which process is invoking remove xattr
calls.
It prints the pid, tid and arguments of all removexattr calls.
I have checked for these fops at the protocol/client and posix translators.
To run the script ..
1) install systemtap and dependencies.
2) install glusterfs-debuginfo
3) change the path of the translator in the systemtap script to
appropriate values for your system
(change "/usr/lib64/glusterfs/3.12dev/xlator/protocol/client.so" and
"/usr/lib64/glusterfs/3.12dev/xlator/storage/posix.so")
4) run the script as follows
#stap -v fop_trace.stp
The o/p would look like these .. additionally arguments will also be
dumped if glusterfs-debuginfo is also installed (i had not done it here.)
pid-958: 0 glusterfsd(3893):->posix_setxattr
pid-958: 47 glusterfsd(3893):<-posix_setxattr
pid-966: 0 glusterfsd(5033):->posix_setxattr
pid-966: 57 glusterfsd(5033):<-posix_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 37 glusterfs(1431):<-client_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 41 glusterfs(1431):<-client_setxattr
Regards,
Sanoj
@ pranith , yes . we can get the pid on all removexattr call and also
print the backtrace of the glusterfsd process when trigerring removing
xattr.
I will write the script and reply back.
On Sat, Jul 8, 2017 at 7:06 AM, Pranith Kumar Karampuri <
Ram,
As per the code, self-heal was the only candidate which *can* do
it. Could you check logs of self-heal daemon and the mount to check if
there are any metadata heals on root?
+Sanoj
Sanoj,
Is there any systemtap script we can use to detect which process is
removing these xattrs?
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.
Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.
commvault.com,glusterfs5sds.commvault.com,glusterfs6sds.commvault.com
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 12:15 PM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
3.7.19
These are the only callers for removexattr and only _posix_remove_xattr
has the potential to do removexattr as posix_removexattr already makes sure
that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr
happens only from healing code of afr/ec. And this can only happen if the
source brick doesn't have gfid, which doesn't seem to match with the
situation you explained.
# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c
<<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c
<<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h
<<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.
PS: sys_fremovexattr is called only from posix_fremovexattr(), so that
doesn't seem to be the culprit as it also have checks to guard against
gfid/volume-id removal.
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:54 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
Pranith,
Thanks for looking in to the issue. The bricks were
mounted after the reboot. One more thing that I noticed was when the
attributes were manually set when glusterd was up then on starting the
volume the attributes were again lost. Had to stop glusterd set attributes
and then start glusterd. After that the volume start succeeded.
Which version is this?
Thanks and Regards,
Ram
*Sent:* Friday, July 07, 2017 11:46 AM
*To:* Ankireddypalle Reddy
*Subject:* Re: [Gluster-devel] gfid and volume-id extended attributes lost
0 if (!strcmp (GFID_XATTR_KEY, name))
{
1 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name))
{
7 gf_msg (this->name, GF_LOG_WARNING, 0,
P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in
the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot,
the brick mounts wouldn't happen and this would lead to absence of both
trusted.gfid and volume-id. So at the moment this is my wild guess.
Hi,
We faced an issue in the production today. We had to stop the
volume and reboot all the servers in the cluster. Once the servers
rebooted starting of the volume failed because the following extended
attributes were not present on all the bricks on 2 servers.
1) trusted.gfid
2) trusted.glusterfs.volume-id
We had to manually set these extended attributes to start the volume. Are
there any such known issues.
Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
_______________________________________________
Gluster-devel mailing list
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
Ankireddypalle Reddy
2017-07-13 13:55:59 UTC
Reply
Permalink
Raw Message
Thanks Pranith. We are waiting for a downtime on our production setup. Will update you once we are able to apply this on our production setup.

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com]
Sent: Thursday, July 13, 2017 4:13 AM
To: Ankireddypalle Reddy
Cc: Sanoj Unnikrishnan; Gluster Devel (gluster-***@gluster.org); gluster-***@gluster.org
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Ram,
I sent https://review.gluster.org/17765 to fix the possibility in bulk removexattr. But I am not sure if this is indeed the reason for this issue.

On Mon, Jul 10, 2017 at 6:30 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Thanks for the swift turn around. Will try this out and let you know.

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Monday, July 10, 2017 8:31 AM
To: Sanoj Unnikrishnan
Cc: Ankireddypalle Reddy; Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>

Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Ram,
If you see it again, you can use this. I am going to send out a patch for the code path which can lead to removal of gfid/volume-id tomorrow.

On Mon, Jul 10, 2017 at 5:19 PM, Sanoj Unnikrishnan <***@redhat.com<mailto:***@redhat.com>> wrote:
Please use the systemtap script(https://paste.fedoraproject.org/paste/EGDa0ErwX0LV3y-gBYpfNA) to check which process is invoking remove xattr calls.
It prints the pid, tid and arguments of all removexattr calls.
I have checked for these fops at the protocol/client and posix translators.

To run the script ..
1) install systemtap and dependencies.
2) install glusterfs-debuginfo
3) change the path of the translator in the systemtap script to appropriate values for your system
(change "/usr/lib64/glusterfs/3.12dev/xlator/protocol/client.so" and "/usr/lib64/glusterfs/3.12dev/xlator/storage/posix.so")
4) run the script as follows
#stap -v fop_trace.stp

The o/p would look like these .. additionally arguments will also be dumped if glusterfs-debuginfo is also installed (i had not done it here.)
pid-958: 0 glusterfsd(3893):->posix_setxattr
pid-958: 47 glusterfsd(3893):<-posix_setxattr
pid-966: 0 glusterfsd(5033):->posix_setxattr
pid-966: 57 glusterfsd(5033):<-posix_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 37 glusterfs(1431):<-client_setxattr
pid-1423: 0 glusterfs(1431):->client_setxattr
pid-1423: 41 glusterfs(1431):<-client_setxattr
Regards,
Sanoj



On Mon, Jul 10, 2017 at 2:56 PM, Sanoj Unnikrishnan <***@redhat.com<mailto:***@redhat.com>> wrote:
@ pranith , yes . we can get the pid on all removexattr call and also print the backtrace of the glusterfsd process when trigerring removing xattr.
I will write the script and reply back.

On Sat, Jul 8, 2017 at 7:06 AM, Pranith Kumar Karampuri <***@redhat.com<mailto:***@redhat.com>> wrote:
Ram,
As per the code, self-heal was the only candidate which *can* do it. Could you check logs of self-heal daemon and the mount to check if there are any metadata heals on root?
+Sanoj
Sanoj,
Is there any systemtap script we can use to detect which process is removing these xattrs?

On Sat, Jul 8, 2017 at 2:58 AM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
We lost the attributes on all the bricks on servers glusterfs2 and glusterfs3 again.

[***@glusterfs2 Log_Files]# gluster volume info

Volume Name: StoragePool
Type: Distributed-Disperse
Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
Status: Started
Number of Bricks: 20 x (2 + 1) = 60
Transport-type: tcp
Bricks:
Brick1: glusterfs1sds:/ws/disk1/ws_brick
Brick2: glusterfs2sds:/ws/disk1/ws_brick
Brick3: glusterfs3sds:/ws/disk1/ws_brick
Brick4: glusterfs1sds:/ws/disk2/ws_brick
Brick5: glusterfs2sds:/ws/disk2/ws_brick
Brick6: glusterfs3sds:/ws/disk2/ws_brick
Brick7: glusterfs1sds:/ws/disk3/ws_brick
Brick8: glusterfs2sds:/ws/disk3/ws_brick
Brick9: glusterfs3sds:/ws/disk3/ws_brick
Brick10: glusterfs1sds:/ws/disk4/ws_brick
Brick11: glusterfs2sds:/ws/disk4/ws_brick
Brick12: glusterfs3sds:/ws/disk4/ws_brick
Brick13: glusterfs1sds:/ws/disk5/ws_brick
Brick14: glusterfs2sds:/ws/disk5/ws_brick
Brick15: glusterfs3sds:/ws/disk5/ws_brick
Brick16: glusterfs1sds:/ws/disk6/ws_brick
Brick17: glusterfs2sds:/ws/disk6/ws_brick
Brick18: glusterfs3sds:/ws/disk6/ws_brick
Brick19: glusterfs1sds:/ws/disk7/ws_brick
Brick20: glusterfs2sds:/ws/disk7/ws_brick
Brick21: glusterfs3sds:/ws/disk7/ws_brick
Brick22: glusterfs1sds:/ws/disk8/ws_brick
Brick23: glusterfs2sds:/ws/disk8/ws_brick
Brick24: glusterfs3sds:/ws/disk8/ws_brick
Brick25: glusterfs4sds.commvault.com:/ws/disk1/ws_brick
Brick26: glusterfs5sds.commvault.com:/ws/disk1/ws_brick
Brick27: glusterfs6sds.commvault.com:/ws/disk1/ws_brick
Brick28: glusterfs4sds.commvault.com:/ws/disk10/ws_brick
Brick29: glusterfs5sds.commvault.com:/ws/disk10/ws_brick
Brick30: glusterfs6sds.commvault.com:/ws/disk10/ws_brick
Brick31: glusterfs4sds.commvault.com:/ws/disk11/ws_brick
Brick32: glusterfs5sds.commvault.com:/ws/disk11/ws_brick
Brick33: glusterfs6sds.commvault.com:/ws/disk11/ws_brick
Brick34: glusterfs4sds.commvault.com:/ws/disk12/ws_brick
Brick35: glusterfs5sds.commvault.com:/ws/disk12/ws_brick
Brick36: glusterfs6sds.commvault.com:/ws/disk12/ws_brick
Brick37: glusterfs4sds.commvault.com:/ws/disk2/ws_brick
Brick38: glusterfs5sds.commvault.com:/ws/disk2/ws_brick
Brick39: glusterfs6sds.commvault.com:/ws/disk2/ws_brick
Brick40: glusterfs4sds.commvault.com:/ws/disk3/ws_brick
Brick41: glusterfs5sds.commvault.com:/ws/disk3/ws_brick
Brick42: glusterfs6sds.commvault.com:/ws/disk3/ws_brick
Brick43: glusterfs4sds.commvault.com:/ws/disk4/ws_brick
Brick44: glusterfs5sds.commvault.com:/ws/disk4/ws_brick
Brick45: glusterfs6sds.commvault.com:/ws/disk4/ws_brick
Brick46: glusterfs4sds.commvault.com:/ws/disk5/ws_brick
Brick47: glusterfs5sds.commvault.com:/ws/disk5/ws_brick
Brick48: glusterfs6sds.commvault.com:/ws/disk5/ws_brick
Brick49: glusterfs4sds.commvault.com:/ws/disk6/ws_brick
Brick50: glusterfs5sds.commvault.com:/ws/disk6/ws_brick
Brick51: glusterfs6sds.commvault.com:/ws/disk6/ws_brick
Brick52: glusterfs4sds.commvault.com:/ws/disk7/ws_brick
Brick53: glusterfs5sds.commvault.com:/ws/disk7/ws_brick
Brick54: glusterfs6sds.commvault.com:/ws/disk7/ws_brick
Brick55: glusterfs4sds.commvault.com:/ws/disk8/ws_brick
Brick56: glusterfs5sds.commvault.com:/ws/disk8/ws_brick
Brick57: glusterfs6sds.commvault.com:/ws/disk8/ws_brick
Brick58: glusterfs4sds.commvault.com:/ws/disk9/ws_brick
Brick59: glusterfs5sds.commvault.com:/ws/disk9/ws_brick
Brick60: glusterfs6sds.commvault.com:/ws/disk9/ws_brick
Options Reconfigured:
performance.readdir-ahead: on
diagnostics.client-log-level: INFO
auth.allow: glusterfs1sds,glusterfs2sds,glusterfs3sds,glusterfs4sds.commvault.com<http://glusterfs4sds.commvault.com>,glusterfs5sds.commvault.com<http://glusterfs5sds.commvault.com>,glusterfs6sds.commvault.com<http://glusterfs6sds.commvault.com>

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 12:15 PM

To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost



On Fri, Jul 7, 2017 at 9:25 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
3.7.19

These are the only callers for removexattr and only _posix_remove_xattr has the potential to do removexattr as posix_removexattr already makes sure that it is not gfid/volume-id. And surprise surprise _posix_remove_xattr happens only from healing code of afr/ec. And this can only happen if the source brick doesn't have gfid, which doesn't seem to match with the situation you explained.

# line filename / context / line
1 1234 xlators/mgmt/glusterd/src/glusterd-quota.c <<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_KEY);
2 1243 xlators/mgmt/glusterd/src/glusterd-quota.c <<glusterd_remove_quota_limit>>
ret = sys_lremovexattr (abspath, QUOTA_LIMIT_OBJECTS_KEY);
3 6102 xlators/mgmt/glusterd/src/glusterd-utils.c <<glusterd_check_and_set_brick_xattr>>
sys_lremovexattr (path, "trusted.glusterfs.test");
4 80 xlators/storage/posix/src/posix-handle.h <<REMOVE_PGFID_XATTR>>
op_ret = sys_lremovexattr (path, key); \
5 5026 xlators/storage/posix/src/posix.c <<_posix_remove_xattr>>
op_ret = sys_lremovexattr (filler->real_path, key);
6 5101 xlators/storage/posix/src/posix.c <<posix_removexattr>>
op_ret = sys_lremovexattr (real_path, name);
7 6811 xlators/storage/posix/src/posix.c <<init>>
sys_lremovexattr (dir_data->data, "trusted.glusterfs.test");
So there are only two possibilities:
1) Source directory in ec/afr doesn't have gfid
2) Something else removed these xattrs.
What is your volume info? May be that will give more clues.

PS: sys_fremovexattr is called only from posix_fremovexattr(), so that doesn't seem to be the culprit as it also have checks to guard against gfid/volume-id removal.

Thanks and Regards,
Ram
From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 11:54 AM

To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost



On Fri, Jul 7, 2017 at 9:20 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Pranith,
Thanks for looking in to the issue. The bricks were mounted after the reboot. One more thing that I noticed was when the attributes were manually set when glusterd was up then on starting the volume the attributes were again lost. Had to stop glusterd set attributes and then start glusterd. After that the volume start succeeded.

Which version is this?


Thanks and Regards,
Ram

From: Pranith Kumar Karampuri [mailto:***@redhat.com<mailto:***@redhat.com>]
Sent: Friday, July 07, 2017 11:46 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-***@gluster.org<mailto:gluster-***@gluster.org>); gluster-***@gluster.org<mailto:gluster-***@gluster.org>
Subject: Re: [Gluster-devel] gfid and volume-id extended attributes lost

Did anything special happen on these two bricks? It can't happen in the I/O path:
posix_removexattr() has:
0 if (!strcmp (GFID_XATTR_KEY, name)) {
1 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
2 "Remove xattr called on gfid for file %s", real_path);
3 op_ret = -1;
4 goto out;
5 }
6 if (!strcmp (GF_XATTR_VOL_ID_KEY, name)) {
7 gf_msg (this->name, GF_LOG_WARNING, 0, P_MSG_XATTR_NOT_REMOVED,
8 "Remove xattr called on volume-id for file %s",
9 real_path);
10 op_ret = -1;
11 goto out;
12 }
I just found that op_errno is not set correctly, but it can't happen in the I/O path, so self-heal/rebalance are off the hook.
I also grepped for any removexattr of trusted.gfid from glusterd and didn't find any.
So one thing that used to happen was that sometimes when machines reboot, the brick mounts wouldn't happen and this would lead to absence of both trusted.gfid and volume-id. So at the moment this is my wild guess.


On Fri, Jul 7, 2017 at 8:39 PM, Ankireddypalle Reddy <***@commvault.com<mailto:***@commvault.com>> wrote:
Hi,
We faced an issue in the production today. We had to stop the volume and reboot all the servers in the cluster. Once the servers rebooted starting of the volume failed because the following extended attributes were not present on all the bricks on 2 servers.

1) trusted.gfid

2) trusted.glusterfs.volume-id

We had to manually set these extended attributes to start the volume. Are there any such known issues.

Thanks and Regards,
Ram
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

_______________________________________________
Gluster-devel mailing list
Gluster-***@gluster.org<mailto:Gluster-***@gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
--
Pranith
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
Loading...