[Gluster-users] Gluster NFS server keeps using more and more RAM (70+GB in-use) until out-of-memory

Niels Hendriks

2018-06-08 11:52:18 UTC

Hello,

We have a 3-way replicated Gluster setup where clients are connected
through NFS and the clients are also the server. Here we see the Gluster
NFS server keeps increasing the RAM usage until eventually the server goes
out of memory. We have this on all 3 servers. The server has 96GB RAM total
and we've seen the Gluster NFS server use op to 70GB RAM and all the swap
was 100% in use. If other processes wouldn't also use the RAM I guess
Gluster would claim that as well.

We are running GlusterFS 3.12.9-1 on Debian 8.
The process causing the high memory is:
/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
/var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S
/var/run/gluster/94e073c0dae2c47025351342ba0ddc44.socket

Gluster volume info:

Volume Name: www
Type: Replicate
Volume ID: fbcc21ee-bd0b-40a5-8785-bd00e49e9b72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.0.0.3:/storage/sdc1/www
Brick2: 10.0.0.2:/storage/sdc1/www
Brick3: 10.0.0.1:/storage/sdc1/www
Options Reconfigured:
diagnostics.client-log-level: ERROR
performance.stat-prefetch: on
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation: on
network.ping-timeout: 3
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off
performance.cache-size: 1GB
performance.write-behind-window-size: 4MB
performance.nfs.io-threads: on
performance.nfs.io-cache: off
performance.nfs.quick-read: off
performance.nfs.write-behind-window-size: 4MB
features.cache-invalidation-timeout: 600
performance.nfs.stat-prefetch: on
network.inode-lru-limit: 90000
performance.cache-priority: *.php:3,*.temp:3,*:1
cluster.readdir-optimize: on
performance.nfs.read-ahead: off
performance.flush-behind: on
performance.write-behind: on
performance.nfs.write-behind: on
performance.nfs.flush-behind: on
features.bitrot: on
features.scrub: Active
performance.quick-read: off
performance.io-thread-count: 64
nfs.enable-ino32: on
nfs.log-level: ERROR
storage.build-pgfid: off
diagnostics.brick-log-level: WARNING
cluster.self-heal-daemon: enable

We don't see anyting in the logs that looks like it could explain the high
memory. We did make a statedump which I'll post here and which I have also
attached as attachment:
https://pastebin.com/raw/sDNF1wwi

Running the command to get the statedump is quite dangerous for us as the
USR1 signal appeared to cause Gluster to move swap memory back into RAM and
go offline while this is in progress.
Fwiw we do have vm.swappiness set to 1

Does anyone have an idea of what could cause this and what we can do to
stop such high memory usage?

Cheers,
Niels