[Gluster-users] Kicking a stuck heal

Discussion:

Dave Sherohman

2018-09-04 10:32:53 UTC

Last Friday, I rebooted one of my gluster nodes and it didn't properly
mount the filesystem holding its brick (I had forgotten to add it to
fstab...), so, when I got back to work on Monday, its root filesystem
was full and the gluster heal info showed around 25000 entries needing
to be healed.

I got the filesystems straightened out and, within a matter of minutes,
the number of entries waiting to be healed in that subvolume dropped to
59. (Showing twice, of course. The cluster is replica 2+A, so the
other full replica and the arbiter are both showing the same list of
entries.) Over a full day later, it's still at 59.

Is there anything I can do to kick the self-heal back into action and
get those final 59 entries cleaned up?

--
Dave Sherohman

Pranith Kumar Karampuri

2018-09-04 10:58:36 UTC

Permalink

Which version of glusterfs are you using?

Post by Dave Sherohman
Last Friday, I rebooted one of my gluster nodes and it didn't properly
mount the filesystem holding its brick (I had forgotten to add it to
fstab...), so, when I got back to work on Monday, its root filesystem
was full and the gluster heal info showed around 25000 entries needing
to be healed.
I got the filesystems straightened out and, within a matter of minutes,
the number of entries waiting to be healed in that subvolume dropped to
59. (Showing twice, of course. The cluster is replica 2+A, so the
other full replica and the arbiter are both showing the same list of
entries.) Over a full day later, it's still at 59.
Is there anything I can do to kick the self-heal back into action and
get those final 59 entries cleaned up?
--
Dave Sherohman
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Pranith

Dave Sherohman

2018-09-04 12:35:34 UTC

Permalink

Post by Dave Sherohman
Is there anything I can do to kick the self-heal back into action and
get those final 59 entries cleaned up?

In response to the request about what version of gluster I'm running
(...which I deleted prematurely...), it's the latest version from the
Debian stable repository, which they identify as 3.8.8-1.

--
Dave Sherohman

Pranith Kumar Karampuri

2018-09-07 05:16:01 UTC

Permalink

Post by Dave Sherohman

Post by Dave Sherohman
Is there anything I can do to kick the self-heal back into action and
get those final 59 entries cleaned up?

In response to the request about what version of gluster I'm running
(...which I deleted prematurely...), it's the latest version from the
Debian stable repository, which they identify as 3.8.8-1.

Hey, 3.8.8-1 is EOL, is it possible to use upstream version that is
maintained like 3.12.x or 4.1.x?

Post by Dave Sherohman
--
Dave Sherohman
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Pranith

Dave Sherohman

2018-09-07 14:00:35 UTC

Permalink

Post by Pranith Kumar Karampuri

Post by Dave Sherohman

Post by Dave Sherohman
Is there anything I can do to kick the self-heal back into action and
get those final 59 entries cleaned up?

In response to the request about what version of gluster I'm running
(...which I deleted prematurely...), it's the latest version from the
Debian stable repository, which they identify as 3.8.8-1.

Hey, 3.8.8-1 is EOL, is it possible to use upstream version that is
maintained like 3.12.x or 4.1.x?

I prefer to stick with the Debian stable releases because they are
**STABLE**. Backported fixes for security issues, and that's it. No
new features to introduce new bugs, no incremental changes that just
happen to break backwards compatibility in the process.

We currently are using an upstream elasticsearch because of an
application which requires features that aren't in deb-stable. As part
of the same server move that led to my question here, we also had our
elasticsearch cluster go down because, when the servers rebooted, a
version incompatibility with one of the es plugins prevented it from
starting back up. I don't want that happening with our disks. I want
something that I know works today and will continue to work tomorrow,
even if a security patch comes out between now and then.

If gluster upstream has a "security fixes and critical bugfixes ONLY,
never a single new feature" version available, then point me at it and
I'd be comfortable switching to that, but if it's the usual "Security
fix? Just upgrade to the latest and greatest new version!", then I'd
really rather not. That model works (more or less...) for end-user
software, but I don't want it anywhere near my servers.

--
Dave Sherohman

Pranith Kumar Karampuri

2018-09-10 07:28:04 UTC

Permalink

Post by Dave Sherohman

Post by Pranith Kumar Karampuri

Post by Dave Sherohman

Post by Dave Sherohman
Is there anything I can do to kick the self-heal back into action and
get those final 59 entries cleaned up?

In response to the request about what version of gluster I'm running
(...which I deleted prematurely...), it's the latest version from the
Debian stable repository, which they identify as 3.8.8-1.

Hey, 3.8.8-1 is EOL, is it possible to use upstream version that is
maintained like 3.12.x or 4.1.x?

+de Vos, Niels <***@redhat.com> +Keithley, Kaleb <***@redhat.com>
+Shyam <***@redhat.com>
Does Debian community do any feature/stability testing for glusterfs to
make sure that the releases are more stable than the releases the gluster
community does? Do you guys know? As far as I understand the deb packages
from stable branches of glusterfs are included in the release?

Post by Dave Sherohman
We currently are using an upstream elasticsearch because of an
application which requires features that aren't in deb-stable. As part
of the same server move that led to my question here, we also had our
elasticsearch cluster go down because, when the servers rebooted, a
version incompatibility with one of the es plugins prevented it from
starting back up. I don't want that happening with our disks. I want
something that I know works today and will continue to work tomorrow,
even if a security patch comes out between now and then.
If gluster upstream has a "security fixes and critical bugfixes ONLY,
never a single new feature" version available, then point me at it and
I'd be comfortable switching to that, but if it's the usual "Security
fix? Just upgrade to the latest and greatest new version!", then I'd
really rather not. That model works (more or less...) for end-user
software, but I don't want it anywhere near my servers.

I'm afraid the answer is no. I personally fixed at least 1 bug which
prevents stuck heal/dead-lock issue which went in 3.10 release (Even though
the description of the bug says arbiter, we found it to be a generic bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1401404) which also has new
features. Let us wait for answers from Shyam/Niels/Kaleb to find if Debian
community does indeed something to make the releases more stable than the
releases that happen in the community (as far as I understand it doesn't).
May be that may convince you to re-consider your stance about the upgrade
to one of the active stable releases on gluster and then we can see if you
still face the problem and we could help fix it in further releases.

Post by Dave Sherohman
--
Dave Sherohman
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Pranith

Dave Sherohman

2018-09-10 11:07:44 UTC

Permalink

Post by Pranith Kumar Karampuri
I'm afraid the answer is no. I personally fixed at least 1 bug which
prevents stuck heal/dead-lock issue which went in 3.10 release (Even though
https://bugzilla.redhat.com/show_bug.cgi?id=1401404) which also has new
features.

Do you know of any additional diagnostics I could run to check whether
my stuck heal problem is the one described in that bug report? The most
obvious symptoms match, but, of course, that's not conclusive proof.

Post by Pranith Kumar Karampuri
Let us wait for answers from Shyam/Niels/Kaleb to find if Debian
community does indeed something to make the releases more stable than
the releases that happen in the community (as far as I understand it
doesn't).

That's my understanding as well. In this case, "stable" is meant in the
"no changes to the software" sense rather than the "do everything
possible to make the software more reliable" sense. Unfortunately,
those two goals are generally in conflict, since making the software
more reliable tends to mean making changes to it.

Post by Pranith Kumar Karampuri
May be that may convince you to re-consider your stance about the
upgrade to one of the active stable releases on gluster and then we
can see if you still face the problem and we could help fix it in
further releases.

Sounds good, and thanks for referring the question to those who would
know!

--
Dave Sherohman

Kaleb S. KEITHLEY

2018-09-10 11:31:44 UTC

Permalink

...
Does Debian community do any feature/stability testing for glusterfs to
make sure that the releases are more stable than the releases the
gluster community does? Do you guys know? As far as I understand the deb
packages from stable branches of glusterfs are included in the release?

Copying Patrick for an authoritative answer.

AFAIK they do _no_ testing other than 'does it build.'

Unless I'm very much mistaken, once they pick a version for a
distribution (e.g. 3.8 for jessie) then that's what they ship for the
life of that distribution.

Which is why the Gluster Community provides 'convenience' packages on
download.gluster.org, LaunchPad PPA, OBS, and the CentOS Storage SIG.

HTH

--
Kaleb

Niels de Vos

2018-09-20 08:34:49 UTC

Permalink

On Thu, Sep 20, 2018 at 10:19:27AM +0200, Patrick Matthäi wrote:
...

Post by Kaleb S. KEITHLEY
Unless I'm very much mistaken, once they pick a version for a
distribution (e.g. 3.8 for jessie) then that's what they ship for the
life of that distribution.

Correct.
But you also can use our stable backports [0]. Currently it contains
version 4.0.2-1~bpo9+1.
[0]: https://backports.debian.org/

This is cool! I was not aware it existed. Could you (or someone else)
send a PR on GitHub to have the Debian stable backports added to
https://docs.gluster.org/en/latest/Install-Guide/Community_Packages/ ?
("Edit on GitHub" in the upper right corner of the page.)

Thanks,
Niels

Kaleb S. KEITHLEY

2018-09-20 11:19:34 UTC

Permalink

Post by Niels de Vos
...

Post by Kaleb S. KEITHLEY
Unless I'm very much mistaken, once they pick a version for a
distribution (e.g. 3.8 for jessie) then that's what they ship for the
life of that distribution.

Correct.
But you also can use our stable backports [0]. Currently it contains
version 4.0.2-1~bpo9+1.
[0]: https://backports.debian.org/

I wasn't aware of it either.

I agree that Debian needs to better advertise the availability of
backports.debian.org, but the Community_Packages page isn't the right
place IMO.

Perhaps we should have a separate Distribution_Packages page where
things like backports.debian.org can be documented.

--
Kaleb

Dave Sherohman

2018-09-20 08:35:58 UTC

Permalink

I was just about to come over and say that, after talking this through
with coworkers, we've decided to upgrade to something outside of Debian
stable. And what should I find?

But you also can use our stable backports [0]. Currently it contains
version 4.0.2-1~bpo9+1.

Generally speaking, what do you think would be more stable, the Debian
stable-backports version or the upstream LTS version (3.12, IIRC)?

And, whichever version we go with, what would be the process for
upgrading from 3.8.8? Can it (safely) be done live? About how long
should we expect it to take to upgrade a 23T (4.5T used) replica 2+A
volume with three subvolumes?

--
Dave Sherohman

Dave Sherohman

2018-10-22 11:22:31 UTC

Permalink

A month and a half later, I've finally managed to make the necessary
arrangements and upgraded last week from gluster 3.8.8 to 3.12.15.

Doing the upgrade cleared up the substantial majority of the entries
which refused to heal, but there are still 5 outstanding entries after
allowing some days to complete them. (I finished upgrades on the
affected subvolume on Wednesday last week and the other two subvolumes
on Friday. And didn't touch anything over the weekend, of course.)

So, going back to my original question, what can I do to get these
remaining entries to heal and have a fully-consistent cluster again?

--
Dave Sherohman