[Gluster-users] Subject: Help needed in improving monitoring in Gluster

Discussion:

Pranith Kumar Karampuri

2018-07-23 14:03:36 UTC

Hi,

We want gluster's monitoring/observability to be as easy as possible
going forward. As part of reaching this goal we are starting this
initiative to add improvements to existing apis/commands and create new
apis/commands to gluster so that the admin can integrate it with whichever
monitoring tool he/she likes. The gluster-prometheus project hosted at
https://github.com/gluster/gluster-prometheus is the direction in which we
feel metrics can be collected and distributed from a Gluster cluster
enabling analysis and visualization.

As a first step we want to hear from you what you feel needs to be
addressed.

Here are some questions we came up with:

1) How do you monitor if the volumes/gluster management plane are behaving
as expected?

2) How do you monitor performance of the volumes/gluster management plane?

3) What are the problems at the moment that take very long before you find
that gluster is not behaving as expected?

4) Are there any gaps that need to be addressed which will add missing
information in the existing commands?

5) What are the aspects of gluster that you wish to monitor but are not
easily able to?

6) What existing monitoring commands in gluster do you wish to use at
regular intervals but you don't because they are too slow/error-prone?

We will be converting the responses we receive until 30th of this month to
github issues and come up with a roadmap for the first release of this
project.

Appreciate your insights and feedback.

Thanks in advance,

On behalf of the team(github handles)

Pranith(@pranithk), Venkata(@vredara), Sridhar(@sseshasa).

Maarten van Baarsel

2018-07-23 14:54:14 UTC

Permalink

Post by Pranith Kumar Karampuri
We want gluster's monitoring/observability to be as easy as possible
going forward. As part of reaching this goal we are starting this
initiative to add improvements to existing apis/commands and create new
apis/commands to gluster so that the admin can integrate it with
whichever monitoring tool he/she likes. The gluster-prometheus project
hosted at https://github.com/gluster/gluster-prometheus is the direction
in which we feel metrics can be collected and distributed from a Gluster
cluster enabling analysis and visualization.
As a first step we want to hear from you what you feel needs to be
addressed.

Regarding monitoring; I would love to see in my monitoring that
geo-replication is working as intended; at the moment I'm faking georep
monitoring by having a process touch a file (every server involved in
gluster touches another file) on every volume and checking mtime on the
slave.

However, I discovered that this is not foolproof: if the georep run
stops for whatever reason the mtime of the monitored file is being kept
updated, probably because it's updated to often, but the georep is not
complete.

I've also seen that a crashed glusterd escapes this monitoring.

What would also be fun is some kind of monitoring where you can find out
why gluster is running at X MB/sec where Y MB/sec is expected (bit large
target, that)

I've once tried monitoring 'gluster volume status all' output but that
only works if everything is OK; with some network problems you can wait
for hours for output which then causes more problems.

Also, I've checked the example output at
https://github.com/gluster/gluster-prometheus:

would JSON or something like that be more friendy to parse instead of
the "[parameter] { [details] } [number]" format?

thanks,
Maarten.

Sankarshan Mukhopadhyay

2018-07-23 16:25:50 UTC

Permalink

On Mon, Jul 23, 2018 at 8:24 PM, Maarten van Baarsel

Post by Maarten van Baarsel

We want gluster's monitoring/observability to be as easy as possible going
forward. As part of reaching this goal we are starting this initiative to
add improvements to existing apis/commands and create new apis/commands to
gluster so that the admin can integrate it with whichever monitoring tool
he/she likes. The gluster-prometheus project hosted at
https://github.com/gluster/gluster-prometheus is the direction in which we
feel metrics can be collected and distributed from a Gluster cluster
enabling analysis and visualization.
As a first step we want to hear from you what you feel needs to be
addressed.

I'd like to request that if possible, you elaborate on how you'd like
to see the "as intended" situation. What kind of data points and/or
visualization would aid you in arriving at that conclusion?

Maarten van Baarsel

2018-07-25 14:28:10 UTC

Permalink

Post by Sankarshan Mukhopadhyay

Post by Maarten van Baarsel
Regarding monitoring; I would love to see in my monitoring that
geo-replication is working as intended; at the moment I'm faking georep
monitoring by having a process touch a file (every server involved in
gluster touches another file) on every volume and checking mtime on the
slave.

I'd like to request that if possible, you elaborate on how you'd like
to see the "as intended" situation. What kind of data points and/or
visualization would aid you in arriving at that conclusion?

for this particular example: perhaps last sync time, and number of files
on both sides. i realize this is a difficult problem...

number of files touched by sync per run?

currently there is a 'started/stopped/faulty' indicator for the geo-rep,
could be exposed as well.

a monitoring interface that is guaranteed to be non-blocking would be a
great enhancement.

M.