[Gluster-users] SQLite3 on 3 node cluster FS?

Discussion:

Paul Anderson

2018-03-05 14:51:48 UTC

Hi,

tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?

I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.

In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.

However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).

Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.

I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.

Does anyone have any suggestions? Any words of widsom would be much appreciated.

Thanks,

Paul

Raghavendra Gowdappa

2018-03-05 16:26:49 UTC

Permalink

Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.

If you are relying on fsync this shouldn't matter as fsync makes sure data
is synced to disk.

Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much appreciated.

Can you experiment with turning on/off various performance xlators? Based
on earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would
suggest turning off all performance xlators. You can refer [1] for a
related discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or fsync
failing? Does your application use O_DIRECT? If yes, please note that you
need to turn the option performance.strict-o-direct on for write-behind to
honour O_DIRECT

Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.

Also, is your application running on a single mount or from multiple
mounts? Can you collect strace of your application (strace -ff -T -p <pid>
-o <file>)? If possible can you also collect fuse-dump using option
--dump-fuse while mounting glusterfs?

[1]
http://lists.gluster.org/pipermail/gluster-users/2018-February/033503.html

Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Paul Anderson

2018-03-05 21:22:29 UTC

Permalink

Raghavendra,

Thanks very much for your reply.

I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.

In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.

Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.

There are no signs of O_DIRECT use in the sqlite3 code that I can see.

I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.

If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)

Thanks again!!

Paul

On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa