[Gluster-users] Files losing permissions

Discussion:

Justin Dossey

2013-08-01 19:25:34 UTC

Hi all,

I have a relatively-new GlusterFS 3.3.2 4-node cluster in
distributed-replicated mode running in a production environment.

After adding bricks from nodes 3 and 4 (which changed the cluster type from
simple replicated-2 to distributed-replicated-2), I've discovered that
files are randomly losing their permissions. These are files that aren't
being accessed by our clients-- some of them haven't been touched for years.

When I say "losing their permissions", I mean that regular files are going
from 0644 to 0000 or 1000.

Since this is a real production issue, I run a parallel find process to
correct them every ten minutes. It has corrected approximately 40,000
files in the past 18 hours.

Is anyone else seeing this kind of issue? My searches have turned up
nothing so far.

--
Justin Dossey
CTO, PodOmatic

Joel Young

2013-08-01 19:32:23 UTC

Permalink

I am not seeing exactly that, but I am experiencing the permission for
the root directory of a gluster volume reverting from a particular
user.user to root.root ownership. I have to periodically do a "cd
/share; chown user.user . "

Post by Justin Dossey
Hi all,
I have a relatively-new GlusterFS 3.3.2 4-node cluster in
distributed-replicated mode running in a production environment.
After adding bricks from nodes 3 and 4 (which changed the cluster type from
simple replicated-2 to distributed-replicated-2), I've discovered that files
are randomly losing their permissions. These are files that aren't being
accessed by our clients-- some of them haven't been touched for years.
When I say "losing their permissions", I mean that regular files are going
from 0644 to 0000 or 1000.
Since this is a real production issue, I run a parallel find process to
correct them every ten minutes. It has corrected approximately 40,000 files
in the past 18 hours.
Is anyone else seeing this kind of issue? My searches have turned up
nothing so far.
--
Justin Dossey
CTO, PodOmatic
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Justin Dossey

2013-08-01 19:57:02 UTC

Permalink

Do you know whether it's acceptable to modify permissions on the brick
itself (as opposed to over NFS or via the fuse client)? It seems that as
long as I don't modify the xattrs, the permissions I set on files on the
bricks are passed through.

Post by Joel Young
I am not seeing exactly that, but I am experiencing the permission for
the root directory of a gluster volume reverting from a particular
user.user to root.root ownership. I have to periodically do a "cd
/share; chown user.user . "

from

Post by Justin Dossey
simple replicated-2 to distributed-replicated-2), I've discovered that

files

Post by Justin Dossey
are randomly losing their permissions. These are files that aren't being
accessed by our clients-- some of them haven't been touched for years.
When I say "losing their permissions", I mean that regular files are

going

Post by Justin Dossey
from 0644 to 0000 or 1000.
Since this is a real production issue, I run a parallel find process to
correct them every ten minutes. It has corrected approximately 40,000

files

Post by Justin Dossey
in the past 18 hours.
Is anyone else seeing this kind of issue? My searches have turned up
nothing so far.
--
Justin Dossey
CTO, PodOmatic
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users

--
Justin Dossey
CTO, PodOmatic

Justin Dossey

2013-08-01 21:25:28 UTC

Permalink

One thing I do see with the issue we're having is that the files which have
lost their permissions have "bad" versions on multiple bricks. Since the
replica count is 2 for any given file, there should be only two copies of
each, no?

For example, the file below has zero-length, zero-permission versions on
uds06/brick2 and uds-07/brick2, but good versions on uds-05/brick1 and
uds-06/brick1.

FILE is /09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-05 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-06 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-06 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-07 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump

Is it acceptable for me to just delete the zero-length copies?

Post by Justin Dossey
Do you know whether it's acceptable to modify permissions on the brick
itself (as opposed to over NFS or via the fuse client)? It seems that as
long as I don't modify the xattrs, the permissions I set on files on the
bricks are passed through.

from

Post by Justin Dossey
simple replicated-2 to distributed-replicated-2), I've discovered that

files

Post by Justin Dossey
are randomly losing their permissions. These are files that aren't

being

Post by Justin Dossey
accessed by our clients-- some of them haven't been touched for years.
When I say "losing their permissions", I mean that regular files are

going

Post by Justin Dossey
from 0644 to 0000 or 1000.
Since this is a real production issue, I run a parallel find process to
correct them every ten minutes. It has corrected approximately 40,000

files

--
Justin Dossey
CTO, PodOmatic

Anand Avati

2013-08-02 04:11:10 UTC

Permalink

Justin,
What you are seeing are internal DHT linkfiles. They are zero byte files
with mode 01000. Changing their mode forcefully in the backend to something
else WILL render your files inaccessible from the mount point. I am
assuming that you have seen these files only in the backend and not from
the mount point. And accessing/modifying files like this directly from the
backend is very dangerous for your data, as explained in this very example.

Avati

Post by Justin Dossey
One thing I do see with the issue we're having is that the files which
have lost their permissions have "bad" versions on multiple bricks. Since
the replica count is 2 for any given file, there should be only two copies
of each, no?
For example, the file below has zero-length, zero-permission versions on
uds06/brick2 and uds-07/brick2, but good versions on uds-05/brick1 and
uds-06/brick1.
FILE is /09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-05 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-06 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-06 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-07 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
Is it acceptable for me to just delete the zero-length copies?

from

Post by Justin Dossey
simple replicated-2 to distributed-replicated-2), I've discovered that

files

Post by Justin Dossey
are randomly losing their permissions. These are files that aren't

being

Post by Justin Dossey
accessed by our clients-- some of them haven't been touched for years.
When I say "losing their permissions", I mean that regular files are

going

Post by Justin Dossey
from 0644 to 0000 or 1000.
Since this is a real production issue, I run a parallel find process to
correct them every ten minutes. It has corrected approximately 40,000

files

--
Justin Dossey
CTO, PodOmatic

--
Justin Dossey
CTO, PodOmatic
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Maik Kulbe

2013-08-02 11:54:09 UTC

Permalink

Hi,

I've just had a problem removing a directory with test files. I had an inaccessible folder which I could neither delete nor read on the client(both NFS and FUSE client). On the backend, the folder had completely 0'd permissions and the files showed the 0'd permissions with the sticky bit. I can't remove the folder on the client(it fails with 'directory not empty') but if I delete the empty files on the backend, it's gone. Is there any explanation for this?

I also found that this only happens, if I remove the folder recursivly over NFS. When I remove the files in the folder first there are no 0-size files on the backend and I can delete the directory with rmdir without any problem.

Post by Anand Avati
Justin,
What you are seeing are internal DHT linkfiles. They are zero byte files
with mode 01000. Changing their mode forcefully in the backend to
something else WILL render your files inaccessible from the mount point. I
am assuming that you have seen these files only in the backend and not
from the mount point And accessing/modifying files like this directly
from the backend is very dangerous for your data, as explained in this
very example.
Avati
One thing I do see with the issue we're having is that the files which
have lost their permissions have "bad" versions on multiple bricks.
Since the replica count is 2 for any given file, there should be only
two copies of each, no?
For example, the file below has zero-length, zero-permission versions on
uds06/brick2 and uds-07/brick2, but good versions on uds-05/brick1 and
uds-06/brick1.
FILE is
/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-05 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-06 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-06 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
uds-07 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/eastar/mail/entries/trash/2008-07-06T13_41_56-07_00.dump
Is it acceptable for me to just delete the zero-length copies?
Do you know whether it's acceptable to modify permissions on the brick
itself (as opposed to over NFS or via the fuse client)? It seems that
as long as I don't modify the xattrs, the permissions I set on files
on the bricks are passed through.
I am not seeing exactly that, but I am experiencing the permission
for
the root directory of a gluster volume reverting from a particular
user.user to root.root ownership. I have to periodically do a "cd
/share; chown user.user . "

type from

Post by Justin Dossey
simple replicated-2 to distributed-replicated-2), I've discovered

that files

Post by Justin Dossey
are randomly losing their permissions. These are files that

aren't being

Post by Justin Dossey
accessed by our clients-- some of them haven't been touched for

years.

Post by Justin Dossey
When I say "losing their permissions", I mean that regular files

are going

Post by Justin Dossey
from 0644 to 0000 or 1000.
Since this is a real production issue, I run a parallel find

process to

Post by Justin Dossey
correct them every ten minutes. It has corrected approximately

40,000 files

Post by Justin Dossey
in the past 18 hours.
Is anyone else seeing this kind of issue? My searches have turned

Post by Justin Dossey
nothing so far.
--
Justin Dossey
CTO, PodOmatic
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users

--
Justin Dossey
CTO, PodOmatic
--
Justin Dossey
CTO, PodOmatic
_______________________________________________
Gluster-users mailing list
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Justin Dossey

2013-08-02 18:10:39 UTC

Permalink

It sounds like this is related to NFS.

Anand, thank you for the response. I was under the impression that DHT
linkfiles are only in the .glusterfs subdirectory on the brick; in my case,
these files are outside that directory. Furthermore, they aren't named
like DHT linkfiles (using that hash key)-- they are named the same as the
actual files. Finally, once I removed the bad files and their DHT
linkfiles, the issue went away and the files remained accessible. I had to
remove around 100,000 of these bad 000/1000 zero-length files (and their
DHT linkfiles) last night; only 324 additional files were detected.

On my volumes, I use a similar hashing scheme to the DHT one for regular
files-- the top level at the volume only has directories 00 through ff,
etc, etc. Perhaps this caused some confusion for you?

For transparency, here is the command I run from the client to detect and
correct bad file permissions:

find ./?? -type f -perm 000 -ls -exec chmod -v 644 {} \; -o -type f -perm
1000 -ls -exec chmod -v 644 {} \;

If this number does not grow, I will conclude that I just missed these 324
files. If the number gets larger, I can only conclude that GlusterFS is
somehow introducing this corruption. If that is the case, I'll dig some
more.

Maik, I may have experienced the same thing. I used rsync over NFS without
--inplace to load my data into the GlusterFS volume, and I wound up with
all those bad files on the wrong bricks (i.e. a file should be only on
server1-brick1 and server2-brick1, but "bad" versions (1000, zero-length)
were also on server3-brick1 and server4-brick1, leading to confusing
results on the clients). Since then, I've switched to using the native
client for data loads and also the --inplace flag to rsync.

Other factors which may have caused the issue I had:

1. During a large rebalance, one GlusterFS node exceeded its system max
open files limit, and was rebooted. The rebalance did not stop while this
took place.
2. Three times during the same rebalance, the Gluster NFS daemon used an
excessive amount of memory and was killed by the kernel oom-killer. The
system in question has 8 GB of memory, was the rebalance master, and is not
running any significant software besides GlusterFS. Each time, I
restarted glusterfs and the NFS server daemon started serving files again.
The rebalance was not interrupted.

Post by Maik Kulbe
Hi,
I've just had a problem removing a directory with test files. I had an
inaccessible folder which I could neither delete nor read on the
client(both NFS and FUSE client). On the backend, the folder had completely
0'd permissions and the files showed the 0'd permissions with the sticky
bit. I can't remove the folder on the client(it fails with 'directory not
empty') but if I delete the empty files on the backend, it's gone. Is there
any explanation for this?
I also found that this only happens, if I remove the folder recursivly
over NFS. When I remove the files in the folder first there are no 0-size
files on the backend and I can delete the directory with rmdir without any
problem.
Justin,

Post by Anand Avati
What you are seeing are internal DHT linkfiles. They are zero byte files
with mode 01000. Changing their mode forcefully in the backend to
something else WILL render your files inaccessible from the mount point. I
am assuming that you have seen these files only in the backend and not
from the mount point And accessing/modifying files like this directly
from the backend is very dangerous for your data, as explained in this
very example.
Avati
One thing I do see with the issue we're having is that the files which
have lost their permissions have "bad" versions on multiple bricks.
Since the replica count is 2 for any given file, there should be only
two copies of each, no?
For example, the file below has zero-length, zero-permission versions on
uds06/brick2 and uds-07/brick2, but good versions on uds-05/brick1 and
uds-06/brick1.
FILE is
/09/38/1f/eastar/mail/entries/**trash/2008-07-06T13_41_56-07_**00.dump
uds-05 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/**eastar/mail/entries/trash/**
2008-07-06T13_41_56-07_00.dump
uds-06 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
/export/brick1/vol1/09/38/1f/**eastar/mail/entries/trash/**
2008-07-06T13_41_56-07_00.dump
uds-06 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/**eastar/mail/entries/trash/**
2008-07-06T13_41_56-07_00.dump
uds-07 ---------T 2 apache apache 0 Jul 23 03:11
/export/brick2/vol1/09/38/1f/**eastar/mail/entries/trash/**
2008-07-06T13_41_56-07_00.dump
Is it acceptable for me to just delete the zero-length copies?
Do you know whether it's acceptable to modify permissions on the brick
itself (as opposed to over NFS or via the fuse client)? It seems that
as long as I don't modify the xattrs, the permissions I set on files
on the bricks are passed through.
I am not seeing exactly that, but I am experiencing the permission
for
the root directory of a gluster volume reverting from a particular
user.user to root.root ownership. I have to periodically do a "cd
/share; chown user.user . "

type from

Post by Justin Dossey
simple replicated-2 to distributed-replicated-2), I've discovered

that files

Post by Justin Dossey
are randomly losing their permissions. These are files that

aren't being

Post by Justin Dossey
accessed by our clients-- some of them haven't been touched for

years.

Post by Justin Dossey
When I say "losing their permissions", I mean that regular files

are going

Post by Justin Dossey
from 0644 to 0000 or 1000.
Since this is a real production issue, I run a parallel find

process to

Post by Justin Dossey
correct them every ten minutes. It has corrected approximately

40,000 files

Post by Justin Dossey
in the past 18 hours.
Is anyone else seeing this kind of issue? My searches have turned

Post by Justin Dossey
nothing so far.
--
Justin Dossey
CTO, PodOmatic
______________________________**_________________
Gluster-users mailing list
http://supercolony.gluster.**org/mailman/listinfo/gluster-**users<http://supercolony.gluster.org/mailman/listinfo/gluster-users>

--
Justin Dossey
CTO, PodOmatic
--
Justin Dossey
CTO, PodOmatic
______________________________**_________________
Gluster-users mailing list
http://supercolony.gluster.**org/mailman/listinfo/gluster-**users<http://supercolony.gluster.org/mailman/listinfo/gluster-users>

--
Justin Dossey
CTO, PodOmatic

Justin Dossey

2013-08-06 16:14:58 UTC

Permalink

Update on this issue:

After I deleted the zero-length files which were located on the wrong
bricks, the issue of files losing permissions is mostly resolved. I left
my find running every 10 minutes for the last five days, though, and the
problem continues to recur with a few hundred files every day or two. This
leads me to believe there is some bug in GlusterFS which causes this to
happen.

My script recorded 1232 files which lost their permissions on August 2 and
870 files on August 6. As I noted earlier, these files were created years
ago. One notable fact is that the mtime as reported by GlusterFS is July
28th,

The rebalance log sheds a bit of light on this, but I'm not sure what to
conclude. Here is the log for one of the affected files:

UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.010359] I
[dht-rebalance.c:1063:gf_defrag_migrate_data] 0-UDS8-dht: migrate data
called on /6f/83/ca/rrrivera25/media
UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.068909] I
[dht-rebalance.c:647:dht_migrate_file] 0-UDS8-dht:
/6f/83/ca/rrrivera25/media/3301856.jpg: attempting to move from
UDS8-replicate-0 to UDS8-replicate-2
UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.068949] I
[dht-rebalance.c:647:dht_migrate_file] 0-UDS8-dht:
/6f/83/ca/rrrivera25/media/3301856.jpg: attempting to move from
UDS8-replicate-0 to UDS8-replicate-2
UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.122885] W
[client3_1-fops.c:1114:client3_1_getxattr_cbk] 0-UDS8-client-0: remote
operation failed: No such file or directory. Path:
/6f/83/ca/rrrivera25/media/3301856.jpg
(00000000-0000-0000-0000-000000000000). Key: (null)
UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.123274] W
[client3_1-fops.c:1114:client3_1_getxattr_cbk] 0-UDS8-client-1: remote
operation failed: No such file or directory. Path:
/6f/83/ca/rrrivera25/media/3301856.jpg
(00000000-0000-0000-0000-000000000000). Key: (null)
UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.123330] W
[dht-rebalance.c:739:dht_migrate_file] 0-UDS8-dht:
/6f/83/ca/rrrivera25/media/3301856.jpg: failed to get xattr from
UDS8-replicate-0 (No such file or directory)
UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.123380] W
[dht-rebalance.c:745:dht_migrate_file] 0-UDS8-dht:
/6f/83/ca/rrrivera25/media/3301856.jpg: failed to set xattr on
UDS8-replicate-2 (Invalid argument)
UDS8-rebalance.log.3.gz:[2013-07-28 13:52:08.134502] I
[dht-rebalance.c:856:dht_migrate_file] 0-UDS8-dht: completed migration of
/6f/83/ca/rrrivera25/media/3301856.jpg from subvolume UDS8-replicate-0 to
UDS8-replicate-2

Post by Justin Dossey
It sounds like this is related to NFS.
Anand, thank you for the response. I was under the impression that DHT
linkfiles are only in the .glusterfs subdirectory on the brick; in my case,
these files are outside that directory. Furthermore, they aren't named
like DHT linkfiles (using that hash key)-- they are named the same as the
actual files. Finally, once I removed the bad files and their DHT
linkfiles, the issue went away and the files remained accessible. I had to
remove around 100,000 of these bad 000/1000 zero-length files (and their
DHT linkfiles) last night; only 324 additional files were detected.
On my volumes, I use a similar hashing scheme to the DHT one for regular
files-- the top level at the volume only has directories 00 through ff,
etc, etc. Perhaps this caused some confusion for you?
For transparency, here is the command I run from the client to detect and
find ./?? -type f -perm 000 -ls -exec chmod -v 644 {} \; -o -type f -perm
1000 -ls -exec chmod -v 644 {} \;
If this number does not grow, I will conclude that I just missed these 324
files. If the number gets larger, I can only conclude that GlusterFS is
somehow introducing this corruption. If that is the case, I'll dig some
more.
Maik, I may have experienced the same thing. I used rsync over NFS
without --inplace to load my data into the GlusterFS volume, and I wound up
with all those bad files on the wrong bricks (i.e. a file should be only on
server1-brick1 and server2-brick1, but "bad" versions (1000, zero-length)
were also on server3-brick1 and server4-brick1, leading to confusing
results on the clients). Since then, I've switched to using the native
client for data loads and also the --inplace flag to rsync.
1. During a large rebalance, one GlusterFS node exceeded its system max
open files limit, and was rebooted. The rebalance did not stop while this
took place.
2. Three times during the same rebalance, the Gluster NFS daemon used an
excessive amount of memory and was killed by the kernel oom-killer. The
system in question has 8 GB of memory, was the rebalance master, and is not
running any significant software besides GlusterFS. Each time, I
restarted glusterfs and the NFS server daemon started serving files again.
The rebalance was not interrupted.

type from

Post by Justin Dossey
simple replicated-2 to distributed-replicated-2), I've discovered

that files

Post by Justin Dossey
are randomly losing their permissions. These are files that

aren't being

Post by Justin Dossey
accessed by our clients-- some of them haven't been touched for

years.

Post by Justin Dossey
When I say "losing their permissions", I mean that regular files

are going

Post by Justin Dossey
from 0644 to 0000 or 1000.
Since this is a real production issue, I run a parallel find

process to

Post by Justin Dossey
correct them every ten minutes. It has corrected approximately

40,000 files

Post by Justin Dossey
in the past 18 hours.
Is anyone else seeing this kind of issue? My searches have turned

--
Justin Dossey
CTO, PodOmatic

Continue reading on narkive:

Search results for '[Gluster-users] Files losing permissions' (Questions and Answers)

replies

My husband wants permission to date the woman he is having an affair with!?!?!?

started 2007-10-11 17:34:09 UTC

marriage & divorce

replies

Can you be stopped at the boarders of your own country if you took your Children out of the country without permission from their dad ?

started 2015-12-11 17:23:25 UTC

law & ethics

replies

False Police Report??? What do you do??? An Ex Friend of mine filed a false police report on me.?

started 2011-05-25 10:37:45 UTC

law enforcement & police

replies

Can you reinstall iphoto without losing your pictures?

started 2011-07-14 22:44:00 UTC

laptops & notebooks

replies

Plz help me; about limewire?

started 2006-07-15 18:26:14 UTC

music