[Gluster-users] What are the difference network communications that happen in gluster?

Jeevan Patnaik

2018-12-02 15:44:01 UTC

Hello all,

I am trying to make a note of all gluster operations that happen over
network inorder to be able to tune gluster nodes as required.

First of all our setup is far from ideal, but we want to tune it to the
best as possible.
-> Our nodes are also LSF execution nodes and due to this, we have a shared
load for cpu, memory and network. (For cpu, memory and network, we are
planning to use cgroups to make enough resources available for gluster).
-> However, in our LSF setup, we allow jobs to use more than the requested
memory and hence, we can expect aggressive swapping when there's too much
requirement.
-> on top of that, our swap disk, gluster bricks and entire os filesystem
comes from same raid disk. So, whenever there's swapping, our only disk's
utilization goes over the top and in turn affects gluster IO.
-> slower performance is okay, as we will take necessary steps in time i.e
kill jobs that are using memory more than the requested.
-> but we don't want network timeout or connection reset errors which could
mess the entire cluster operations and would need a bit of heavy work to
resolve them.

-> I'm not sure if the above scenario can cause these timeout errors.
However, there are other cases which can cause these and are also observed.
-> we increased transport.listen-backlog in gluster to a higher value: 200
and tuned kenel somaxcon=1024, syn_backlog=20480
-> these are just random high values, but not sure if these are enough.

-> so, we can fairly expect timeout errors as our tuning is not perfect.
Hence, to be able to analyze these issues, I want to find out possible
number of pending connections, network communications and for that, I need
to know all the gluster operations and their frequency.

Example:
Self Heal daemon operations
Peer communications for gluster peer status..does this happen?
And etc.

Thanks in advance.

Regards,
Jeevan.