[Gluster-users] Problems since 3.12.7: invisible files, strange rebalance size, setxattr failed during rebalance and broken unix rights

Discussion:

Frank Ruehlemann

2018-04-23 13:22:35 UTC

Permalink

Hi,

after 2 years running GlusterFS without bigger problems we're facing
some strange errors lately.

After updating to 3.12.7 some user reported at least 4 broken
directories with some invisible files. The files are at the bricks and
don't start with a dot, but aren't visible in "ls". Clients still can
interact with them by using the explicit path.
More information: https://bugzilla.redhat.com/show_bug.cgi?id=1564071

And since this update gluster reported for the rebalance of >16900 PB
(Petabyte!) of data for one of our 2 server, when using „gluster volume
rebalance $myvolume status“. The time looks right, but the size of
transfered files is absurd. The rebalance was with 3.12.6 in March 2018.
The last rebalance log file listed no errors and a realistic size at the
end.

We started a new rebalance today during a downtime of our corresponding
compute cluster, since these errors started to spread and this might
help. The output of „gluster volume rebalance $myvolume status“ doesn't
list any errors so far and the numbers look like realistic values.
But we're seeing some strange errors (every few minutes) reports in the
journald:
„[2018-04-23 12:31:24.942377] E [MSGID: 113001]
[posix.c:5983:_posix_handle_xattr_keyvalue_pair] 0-$myvolume-posix:
setxattr failed
on /srv/glusterfs/bricks/DATA112/data/.glusterfs/e6/a8/e6a8ce50-fda5-4bad-8d4d-acd25dafcaa2 while doing xattrop: key=trusted.glusterfs.quota.1ce02d3b-b7ae-4485-903c-2991de5350b6.contri.1 [No such file or directory]“
The rebalance log file lists no errors.

Has anybody seen similar error messages during a rebalance?

And we see some files dublicated. There are two copies on different
bricks (we're running a distributed volume).
One copy looks like this:
$ ls -lah
-rwxr--r-- 2 $user $group 293 May 11 2017 config

The other one looks rather strange:
$ ls -lah
---------T 2 root $group 0 May 11 2017 config

Has anybody seen similar broken files?

We're using gluster 3.12 from the gluster.org-repositories on a standard
Debian 9 with XFS formatted bricks.

Hopefully somebody might have an answer how to fix this.

At least somebody in the future might find this, since we didn't found
anything while searching after these errors. If you're from the future:
Good luck! (^_^)

So far,
--
Frank Rühlemann
IT-Systemtechnik

UNIVERSITÄT ZU LÜBECK
IT-Service-Center

Ratzeburger Allee 160
23562 Lübeck
Tel +49 451 3101 2034
Fax +49 451 3101 2004
***@itsc.uni-luebeck.de
www.itsc.uni-luebeck.de

Nithya Balachandran

2018-04-23 13:42:44 UTC

Permalink

Hi,

What is the output of 'gluster volume info' for this volume?

Regards,
Nithya

Post by Frank Ruehlemann
Hi,
after 2 years running GlusterFS without bigger problems we're facing
some strange errors lately.
After updating to 3.12.7 some user reported at least 4 broken
directories with some invisible files. The files are at the bricks and
don't start with a dot, but aren't visible in "ls". Clients still can
interact with them by using the explicit path.
More information: https://bugzilla.redhat.com/show_bug.cgi?id=1564071
And since this update gluster reported for the rebalance of >16900 PB
(Petabyte!) of data for one of our 2 server, when using âgluster volume
rebalance $myvolume statusâ. The time looks right, but the size of
transfered files is absurd. The rebalance was with 3.12.6 in March 2018.
The last rebalance log file listed no errors and a realistic size at the
end.
We started a new rebalance today during a downtime of our corresponding
compute cluster, since these errors started to spread and this might
help. The output of âgluster volume rebalance $myvolume statusâ doesn't
list any errors so far and the numbers look like realistic values.
But we're seeing some strange errors (every few minutes) reports in the
â[2018-04-23 12:31:24.942377] E [MSGID: 113001]
setxattr failed
on /srv/glusterfs/bricks/DATA112/data/.glusterfs/e6/a8/
key=trusted.glusterfs.quota.1ce02d3b-b7ae-4485-903c-2991de5350b6.contri.1
[No such file or directory]â
The rebalance log file lists no errors.
Has anybody seen similar error messages during a rebalance?
And we see some files dublicated. There are two copies on different
bricks (we're running a distributed volume).
$ ls -lah
-rwxr--r-- 2 $user $group 293 May 11 2017 config
$ ls -lah
---------T 2 root $group 0 May 11 2017 config
Has anybody seen similar broken files?
We're using gluster 3.12 from the gluster.org-repositories on a standard
Debian 9 with XFS formatted bricks.
Hopefully somebody might have an answer how to fix this.
At least somebody in the future might find this, since we didn't found
Good luck! (^_^)
So far,
--
Frank RÃŒhlemann
IT-Systemtechnik
UNIVERSITÃT ZU LÃBECK
IT-Service-Center
Ratzeburger Allee 160
23562 LÃŒbeck
Tel +49 451 3101 2034
Fax +49 451 3101 2004
www.itsc.uni-luebeck.de
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Frank Ruehlemann

2018-04-23 14:06:05 UTC

Permalink

Hi,

here it is.

# gluster volume info $myvolume

Volume Name: $myvolume
Type: Distribute
Volume ID: 0d210c70-e44f-46f1-862c-ef260514c9f1
Status: Started
Snapshot Count: 0
Number of Bricks: 23
Transport-type: tcp
Bricks:
Brick1: gluster02:/srv/glusterfs/bricks/DATA201/data
Brick2: gluster02:/srv/glusterfs/bricks/DATA202/data
Brick3: gluster02:/srv/glusterfs/bricks/DATA203/data
Brick4: gluster02:/srv/glusterfs/bricks/DATA204/data
Brick5: gluster02:/srv/glusterfs/bricks/DATA205/data
Brick6: gluster02:/srv/glusterfs/bricks/DATA206/data
Brick7: gluster02:/srv/glusterfs/bricks/DATA207/data
Brick8: gluster02:/srv/glusterfs/bricks/DATA208/data
Brick9: gluster01:/srv/glusterfs/bricks/DATA110/data
Brick10: gluster01:/srv/glusterfs/bricks/DATA111/data
Brick11: gluster01:/srv/glusterfs/bricks/DATA112/data
Brick12: gluster01:/srv/glusterfs/bricks/DATA113/data
Brick13: gluster01:/srv/glusterfs/bricks/DATA114/data
Brick14: gluster02:/srv/glusterfs/bricks/DATA209/data
Brick15: gluster01:/srv/glusterfs/bricks/DATA101/data
Brick16: gluster01:/srv/glusterfs/bricks/DATA102/data
Brick17: gluster01:/srv/glusterfs/bricks/DATA103/data
Brick18: gluster01:/srv/glusterfs/bricks/DATA104/data
Brick19: gluster01:/srv/glusterfs/bricks/DATA105/data
Brick20: gluster01:/srv/glusterfs/bricks/DATA106/data
Brick21: gluster01:/srv/glusterfs/bricks/DATA107/data
Brick22: gluster01:/srv/glusterfs/bricks/DATA108/data
Brick23: gluster01:/srv/glusterfs/bricks/DATA109/data
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
auth.allow: $myipspace
performance.readdir-ahead: on
diagnostics.brick-log-level: WARNING
nfs.disable: on
transport.address-family: inet
nfs.addr-namelookup: off
diagnostics.brick-sys-log-level: WARNING

Well at least one thing got fixed by this reboot: "df -h" returns a
realistic size of the volume etc. This wasn't the case after our update
to 3.12.7.

Best Regards,
--
Frank Rühlemann
IT-Systemtechnik

UNIVERSITÄT ZU LÜBECK
IT-Service-Center

Ratzeburger Allee 160
23562 Lübeck
Tel +49 451 3101 2034
Fax +49 451 3101 2004
***@itsc.uni-luebeck.de
www.itsc.uni-luebeck.de

Post by Nithya Balachandran
Hi,
What is the output of 'gluster volume info' for this volume?
Regards,
Nithya

Post by Frank Ruehlemann
Hi,
after 2 years running GlusterFS without bigger problems we're facing
some strange errors lately.
After updating to 3.12.7 some user reported at least 4 broken
directories with some invisible files. The files are at the bricks and
don't start with a dot, but aren't visible in "ls". Clients still can
interact with them by using the explicit path.
More information: https://bugzilla.redhat.com/show_bug.cgi?id=1564071
And since this update gluster reported for the rebalance of >16900 PB
(Petabyte!) of data for one of our 2 server, when using „gluster volume
rebalance $myvolume status“. The time looks right, but the size of
transfered files is absurd. The rebalance was with 3.12.6 in March 2018.
The last rebalance log file listed no errors and a realistic size at the
end.
We started a new rebalance today during a downtime of our corresponding
compute cluster, since these errors started to spread and this might
help. The output of „gluster volume rebalance $myvolume status“ doesn't
list any errors so far and the numbers look like realistic values.
But we're seeing some strange errors (every few minutes) reports in the
„[2018-04-23 12:31:24.942377] E [MSGID: 113001]
setxattr failed
on /srv/glusterfs/bricks/DATA112/data/.glusterfs/e6/a8/
key=trusted.glusterfs.quota.1ce02d3b-b7ae-4485-903c-2991de5350b6.contri.1
[No such file or directory]“
The rebalance log file lists no errors.
Has anybody seen similar error messages during a rebalance?
And we see some files dublicated. There are two copies on different
bricks (we're running a distributed volume).
$ ls -lah
-rwxr--r-- 2 $user $group 293 May 11 2017 config
$ ls -lah
---------T 2 root $group 0 May 11 2017 config
Has anybody seen similar broken files?
We're using gluster 3.12 from the gluster.org-repositories on a standard
Debian 9 with XFS formatted bricks.
Hopefully somebody might have an answer how to fix this.
At least somebody in the future might find this, since we didn't found
Good luck! (^_^)
So far,
--
Frank Rühlemann
IT-Systemtechnik
UNIVERSITÄT ZU LÜBECK
IT-Service-Center
Ratzeburger Allee 160
23562 Lübeck
Tel +49 451 3101 2034
Fax +49 451 3101 2004
www.itsc.uni-luebeck.de
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Nithya Balachandran

2018-04-23 16:21:58 UTC

Permalink

Hi,

I will continue the analysis for this issue in the bug.

Post by Frank Ruehlemann
And since this update gluster reported for the rebalance of >16900 PB
(Petabyte!) of data for one of our 2 server, when using âgluster volume
rebalance $myvolume statusâ. The time looks right, but the size of
transfered files is absurd. The rebalance was with 3.12.6 in March 2018.
The last rebalance log file listed no errors and a realistic size at the
end.

This has been seen a few times and is because an incorrect value is stored
in the node_state.info file . However, I don't know what causes this
incorrect value to be stored. It is harmless and can be ignored.

Post by Frank Ruehlemann
We started a new rebalance today during a downtime of our corresponding
compute cluster, since these errors started to spread and this might
help. The output of âgluster volume rebalance $myvolume statusâ doesn't
list any errors so far and the numbers look like realistic values.
But we're seeing some strange errors (every few minutes) reports in the
â[2018-04-23 12:31:24.942377] E [MSGID: 113001]
setxattr failed
on /srv/glusterfs/bricks/DATA112/data/.glusterfs/e6/a8/
key=trusted.glusterfs.quota.1ce02d3b-b7ae-4485-903c-2991de5350b6.contri.1
[No such file or directory]â
The rebalance log file lists no errors.
Has anybody seen similar error messages during a rebalance?

Are any directories being deleted/renamed during the rebalance? If yes,
this could be a valid message.

Post by Frank Ruehlemann
And we see some files dublicated. There are two copies on different
bricks (we're running a distributed volume).
$ ls -lah
-rwxr--r-- 2 $user $group 293 May 11 2017 config
$ ls -lah
---------T 2 root $group 0 May 11 2017 config
Has anybody seen similar broken files?

This is fine as long as you only see a single file from the mount point.
The 'T' files are internal gluster files (called linkto files) and should
be invisible from the mount point.

Regards,
Nithya

Post by Frank Ruehlemann
We're using gluster 3.12 from the gluster.org-repositories on a standard
Debian 9 with XFS formatted bricks.
Hopefully somebody might have an answer how to fix this.
At least somebody in the future might find this, since we didn't found
Good luck! (^_^)
So far,
--
Frank RÃŒhlemann
IT-Systemtechnik
UNIVERSITÃT ZU LÃBECK
IT-Service-Center
Ratzeburger Allee 160
23562 LÃŒbeck
Tel +49 451 3101 2034
Fax +49 451 3101 2004
www.itsc.uni-luebeck.de
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Frank Ruehlemann

2018-04-24 08:26:48 UTC

Permalink

Hi,

thank you for you quick answer.

Post by Nithya Balachandran

I will continue the analysis for this issue in the bug.

This would be very helpful. We saw your request for additional
information and will provide them as soon as possible.

Post by Nithya Balachandran

Post by Frank Ruehlemann
And since this update gluster reported for the rebalance of >16900 PB
(Petabyte!) of data for one of our 2 server, when using „gluster volume
rebalance $myvolume status“. The time looks right, but the size of
transfered files is absurd. The rebalance was with 3.12.6 in March 2018.
The last rebalance log file listed no errors and a realistic size at the
end.

Ok. :)

Post by Nithya Balachandran

Post by Frank Ruehlemann
We started a new rebalance today during a downtime of our corresponding
compute cluster, since these errors started to spread and this might
help. The output of „gluster volume rebalance $myvolume status“ doesn't
list any errors so far and the numbers look like realistic values.
But we're seeing some strange errors (every few minutes) reports in the
„[2018-04-23 12:31:24.942377] E [MSGID: 113001]
setxattr failed
on /srv/glusterfs/bricks/DATA112/data/.glusterfs/e6/a8/
key=trusted.glusterfs.quota.1ce02d3b-b7ae-4485-903c-2991de5350b6.contri.1
[No such file or directory]“
The rebalance log file lists no errors.
Has anybody seen similar error messages during a rebalance?

Are any directories being deleted/renamed during the rebalance? If yes,
this could be a valid message.

No. We locked out all users and took down all clients that mount the volume before we started the rebalance to ensure that there's no interaction of any client with it.
The messages continued during the last hours and occurred up to several times per minute with some sporadic phases without them on all bricks of this volume.

Post by Nithya Balachandran

This is fine as long as you only see a single file from the mount point.
The 'T' files are internal gluster files (called linkto files) and should
be invisible from the mount point.
Regards,
Nithya

This is good to know. Yes, all files we saw so far had only one of those
files.

Thanks for your message. It helped a lot.
--
Frank Rühlemann
IT-Systemtechnik

UNIVERSITÄT ZU LÜBECK
IT-Service-Center

Ratzeburger Allee 160
23562 Lübeck
Tel +49 451 3101 2034
Fax +49 451 3101 2004
***@itsc.uni-luebeck.de
www.itsc.uni-luebeck.de