Brian Andrus
2018-06-27 14:19:25 UTC
All,
I have a gluster filesystem (glusterfs-4.0.2-1, Type:
Distributed-Replicate, Number of Bricks: 5 x 3 = 15)
I have one directory that is used for slurm statefiles, which seems to
get out of sync fairly often. There are particular files that end up
never healing.
Since the files are ephemeral, I'm ok with losing them (for now).
Following some advice, I deleted UUID files that were in
/GLUSTER/brick1/.glusterfs/indices/xattrop/
This makes gluster volume heal GDATA statistics heal-count show no
issues, however the issue is still there. Even though nothing is showing
up with gluster volume heal GDATA info, there are some files/directories
that, if I try to access them at all, I get "Transport endpoint is not
connected"
There is even a directory, which is empty but if I try to 'rmdir' it, I
get "rmdir: failed to remove â/DATA/slurmstate.old/slurm/â: Software
caused connection abort" and the mount goes bad. I have to umount/mount
it to get it back.
There is a bit of info in the log file that has to do with the crash
which is attached.
How do I clean this up? And what is the 'proper' way to handle when you
have a file that will not heal even in a 3-way replicate?
Brian Andrus
I have a gluster filesystem (glusterfs-4.0.2-1, Type:
Distributed-Replicate, Number of Bricks: 5 x 3 = 15)
I have one directory that is used for slurm statefiles, which seems to
get out of sync fairly often. There are particular files that end up
never healing.
Since the files are ephemeral, I'm ok with losing them (for now).
Following some advice, I deleted UUID files that were in
/GLUSTER/brick1/.glusterfs/indices/xattrop/
This makes gluster volume heal GDATA statistics heal-count show no
issues, however the issue is still there. Even though nothing is showing
up with gluster volume heal GDATA info, there are some files/directories
that, if I try to access them at all, I get "Transport endpoint is not
connected"
There is even a directory, which is empty but if I try to 'rmdir' it, I
get "rmdir: failed to remove â/DATA/slurmstate.old/slurm/â: Software
caused connection abort" and the mount goes bad. I have to umount/mount
it to get it back.
There is a bit of info in the log file that has to do with the crash
which is attached.
How do I clean this up? And what is the 'proper' way to handle when you
have a file that will not heal even in a 3-way replicate?
Brian Andrus