[Gluster-users] How to check running transactions in gluster?

Discussion:

Jeevan Patnaik

2018-11-25 15:09:20 UTC

Hi,

I am getting output Another transaction is in progress with few gluster
volume commands including stop command. And with gluster volume status
command, it's just hung and fails with timeout error.

So, I want to find out which transaction is hung and how to know this? I
ran volume statedump command, but didn't wait till it's completed to check
if it's hung or giving any resut, as it is also taking time.

Please help me with this.. I'm struggling with these gluster timeout errors
:(

Besides, I have also tuned
transport.listen-backlog gluster to 200 and following kernel parameters to
avoid syn overflow rejects:
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 20480

Regards,
Jeevan.

Atin Mukherjee

2018-11-26 02:51:52 UTC

Permalink

Post by Jeevan Patnaik
Hi,
I am getting output Another transaction is in progress with few gluster
volume commands including stop command. And with gluster volume status
command, it's just hung and fails with timeout error.

This is primarily because of not allowing glusterd to complete it's
handshake with others when concurrent restart of glusterd services are done
(as I could understand from your previous email in the list). With GlusterD
(read as GD1) this is a current challenge w.r.t it's design where due to
its N X N handshaking mechanism at the restart sequence to bring all the
configuration data into inconsistent what we've seen is the overall
recovery time of the cluster can take very long if N is on the higher side
(in your case N = 72 which is certainly high) and hence the recommendation
is not to restart the glusterd services concurrently and wait for the
handshaking to complete.

Post by Jeevan Patnaik
So, I want to find out which transaction is hung and how to know this? I
ran volume statedump command, but didn't wait till it's completed to check
if it's hung or giving any resut, as it is also taking time.

kill -SIGUSR1 $(pidof glusterd) should get you a glusterd statedump file in
/var/run/gluster which can point to a backtrace dump at the bottom to
understand which transaction is currently in progress. In case this
transaction is queued up for more than 180 seconds (which is not usual) the
unlock timer kicks out such locks.

Post by Jeevan Patnaik
Please help me with this.. I'm struggling with these gluster timeout
errors :(
Besides, I have also tuned
transport.listen-backlog gluster to 200 and following kernel parameters to
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 20480
Regards,
Jeevan.
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

Atin Mukherjee

2018-11-26 02:52:42 UTC

Permalink

Post by Atin Mukherjee

Forgot to mention that GlusterD2 ( https://github.com/gluster/glusterd2)
which is in development phase addresses this design problem.

Post by Atin Mukherjee

kill -SIGUSR1 $(pidof glusterd) should get you a glusterd statedump file
in /var/run/gluster which can point to a backtrace dump at the bottom to
understand which transaction is currently in progress. In case this
transaction is queued up for more than 180 seconds (which is not usual) the
unlock timer kicks out such locks.

Post by Jeevan Patnaik
Please help me with this.. I'm struggling with these gluster timeout
errors :(
Besides, I have also tuned
transport.listen-backlog gluster to 200 and following kernel parameters
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 20480
Regards,
Jeevan.
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

Jeevan Patnaik

2018-11-26 04:04:19 UTC

Permalink

Hi Atin,

Thanks for the details. I think the issue is with few of the nodes which
aren't serving any bricks in rejected state. When I remove them from pool
and stop glusterfs in those nodes, everything seems normal.

We keep those nodes as spares, but have glusterd runnin. coz in our
configuration, servers are also clients and we are using gluster NFS
without failover for mounts and to localize the impact if a node goes down,
we use localhost as the nfs server on each node.
I.e.,
mount -t nfs localhost:/volume /mointpoint

So, glusterfs should be running in these spare nodes. Now is this okay to
keep those nodes in the pool? Will they go to rejected state again and
cause transaction locks. Why aren't they in sync though they're part of the
pool.

Regards,
Jeevan.

Post by Atin Mukherjee

Forgot to mention that GlusterD2 ( https://github.com/gluster/glusterd2)
which is in development phase addresses this design problem.

Post by Atin Mukherjee

kill -SIGUSR1 $(pidof glusterd) should get you a glusterd statedump file
in /var/run/gluster which can point to a backtrace dump at the bottom to
understand which transaction is currently in progress. In case this
transaction is queued up for more than 180 seconds (which is not usual) the
unlock timer kicks out such locks.

Post by Jeevan Patnaik
Please help me with this.. I'm struggling with these gluster timeout
errors :(
Besides, I have also tuned
transport.listen-backlog gluster to 200 and following kernel parameters
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 20480
Regards,
Jeevan.
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

Sanju Rakonde

2018-11-26 12:44:19 UTC

Permalink

Hi Jeevan,

You might be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1635820

Were any of the volumes in "Created" state, when the peer reject issue is
seen?

Thanks,
Sanju

Post by Jeevan Patnaik
Hi Atin,
Thanks for the details. I think the issue is with few of the nodes which
aren't serving any bricks in rejected state. When I remove them from pool
and stop glusterfs in those nodes, everything seems normal.
We keep those nodes as spares, but have glusterd runnin. coz in our
configuration, servers are also clients and we are using gluster NFS
without failover for mounts and to localize the impact if a node goes down,
we use localhost as the nfs server on each node.
I.e.,
mount -t nfs localhost:/volume /mointpoint
So, glusterfs should be running in these spare nodes. Now is this okay to
keep those nodes in the pool? Will they go to rejected state again and
cause transaction locks. Why aren't they in sync though they're part of the
pool.
Regards,
Jeevan.

Post by Atin Mukherjee

Forgot to mention that GlusterD2 ( https://github.com/gluster/glusterd2)
which is in development phase addresses this design problem.

Post by Atin Mukherjee

Post by Jeevan Patnaik
So, I want to find out which transaction is hung and how to know this?
I ran volume statedump command, but didn't wait till it's completed to
check if it's hung or giving any resut, as it is also taking time.

kill -SIGUSR1 $(pidof glusterd) should get you a glusterd statedump file
in /var/run/gluster which can point to a backtrace dump at the bottom to
understand which transaction is currently in progress. In case this
transaction is queued up for more than 180 seconds (which is not usual) the
unlock timer kicks out such locks.

Post by Jeevan Patnaik
Please help me with this.. I'm struggling with these gluster timeout
errors :(
Besides, I have also tuned
transport.listen-backlog gluster to 200 and following kernel parameters
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 20480
Regards,
Jeevan.
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Thanks,
Sanju

Jeevan Patnaik

2018-11-27 06:43:30 UTC

Permalink

Hi,

Thanks. I might have tried to stop the volume before rebooting the
glusterd. Also, it's fine now, I think volume is already started.

Regards,
Jeevan.

Post by Sanju Rakonde
Hi Jeevan,
You might be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1635820
Were any of the volumes in "Created" state, when the peer reject issue is
seen?
Thanks,
Sanju

Post by Atin Mukherjee

Post by Jeevan Patnaik
Hi,
I am getting output Another transaction is in progress with few
gluster volume commands including stop command. And with gluster volume
status command, it's just hung and fails with timeout error.

Forgot to mention that GlusterD2 ( https://github.com/gluster/glusterd2)
which is in development phase addresses this design problem.

Post by Atin Mukherjee

Post by Jeevan Patnaik
So, I want to find out which transaction is hung and how to know this?
I ran volume statedump command, but didn't wait till it's completed to
check if it's hung or giving any resut, as it is also taking time.

kill -SIGUSR1 $(pidof glusterd) should get you a glusterd statedump
file in /var/run/gluster which can point to a backtrace dump at the bottom
to understand which transaction is currently in progress. In case this
transaction is queued up for more than 180 seconds (which is not usual) the
unlock timer kicks out such locks.

Post by Jeevan Patnaik
Please help me with this.. I'm struggling with these gluster timeout
errors :(
Besides, I have also tuned
transport.listen-backlog gluster to 200 and following kernel
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 20480
Regards,
Jeevan.
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Thanks,
Sanju