Discussion:
[Gluster-users] SQLite3 on 3 node cluster FS?
Paul Anderson
2018-03-05 14:51:48 UTC
Permalink
Hi,

tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?

I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.

In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.

However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).

Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.

I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.

Does anyone have any suggestions? Any words of widsom would be much appreciated.

Thanks,

Paul
Raghavendra Gowdappa
2018-03-05 16:26:49 UTC
Permalink
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure data
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much appreciated.
Can you experiment with turning on/off various performance xlators? Based
on earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would
suggest turning off all performance xlators. You can refer [1] for a
related discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or fsync
failing? Does your application use O_DIRECT? If yes, please note that you
need to turn the option performance.strict-o-direct on for write-behind to
honour O_DIRECT

Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.

Also, is your application running on a single mount or from multiple
mounts? Can you collect strace of your application (strace -ff -T -p <pid>
-o <file>)? If possible can you also collect fuse-dump using option
--dump-fuse while mounting glusterfs?

[1]
http://lists.gluster.org/pipermail/gluster-users/2018-February/033503.html
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Paul Anderson
2018-03-05 21:22:29 UTC
Permalink
Raghavendra,

Thanks very much for your reply.

I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.

In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.

Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.

There are no signs of O_DIRECT use in the sqlite3 code that I can see.

I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.

If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)

Thanks again!!

Paul

On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure data
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much appreciated.
Can you experiment with turning on/off various performance xlators? Based on
earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would suggest
turning off all performance xlators. You can refer [1] for a related
discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or fsync
failing? Does your application use O_DIRECT? If yes, please note that you
need to turn the option performance.strict-o-direct on for write-behind to
honour O_DIRECT
Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.
Also, is your application running on a single mount or from multiple mounts?
Can you collect strace of your application (strace -ff -T -p <pid> -o
<file>)? If possible can you also collect fuse-dump using option --dump-fuse
while mounting glusterfs?
[1]
http://lists.gluster.org/pipermail/gluster-users/2018-February/033503.html
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Amar Tumballi
2018-03-06 00:19:23 UTC
Permalink
Post by Paul Anderson
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Sure, sharing the test cases makes it very easy for us to see what would be
the issue. I would recommend a github repo for the script.

Regards,
Amar
Joe Julian
2018-03-06 01:09:19 UTC
Permalink
Tough to do. Like in my case where you would have to install and use Plex.
Post by Amar Tumballi
Post by Paul Anderson
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Sure, sharing the test cases makes it very easy for us to see what would be
the issue. I would recommend a github repo for the script.
Regards,
Amar
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Csaba Henk
2018-03-06 06:14:10 UTC
Permalink
Post by Amar Tumballi
Post by Paul Anderson
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Sure, sharing the test cases makes it very easy for us to see what would be
the issue. I would recommend a github repo for the script.
Regards,
Amar
I'm also curious about the tests.

Csaba
Raghavendra Gowdappa
2018-03-06 03:39:51 UTC
Permalink
+Csaba.
Post by Paul Anderson
Raghavendra,
Thanks very much for your reply.
I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.
Good to know it worked. Can you give us the output of
# gluster volume info

We would like to debug the problem in write-behind. Some questions:

1. What version of Glusterfs are you using?
2. Were you able to figure out whether its stale data or metadata that is
causing the issue?

There have been patches merged in write-behind in recent past and one in
the works which address metadata consistency. Would like to understand
whether you've run into any of the already identified issues.

regards,
Raghavendra
Post by Paul Anderson
In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.
Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.
There are no signs of O_DIRECT use in the sqlite3 code that I can see.
I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Thanks again!!
Paul
On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure
data
Post by Raghavendra Gowdappa
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much appreciated.
Can you experiment with turning on/off various performance xlators?
Based on
Post by Raghavendra Gowdappa
earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would
suggest
Post by Raghavendra Gowdappa
turning off all performance xlators. You can refer [1] for a related
discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or
fsync
Post by Raghavendra Gowdappa
failing? Does your application use O_DIRECT? If yes, please note that you
need to turn the option performance.strict-o-direct on for write-behind
to
Post by Raghavendra Gowdappa
honour O_DIRECT
Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.
Also, is your application running on a single mount or from multiple
mounts?
Post by Raghavendra Gowdappa
Can you collect strace of your application (strace -ff -T -p <pid> -o
<file>)? If possible can you also collect fuse-dump using option
--dump-fuse
Post by Raghavendra Gowdappa
while mounting glusterfs?
[1]
http://lists.gluster.org/pipermail/gluster-users/2018-
February/033503.html
Post by Raghavendra Gowdappa
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Raghavendra Gowdappa
2018-03-06 03:40:38 UTC
Permalink
Adding csaba
Post by Raghavendra Gowdappa
+Csaba.
Post by Paul Anderson
Raghavendra,
Thanks very much for your reply.
I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.
Good to know it worked. Can you give us the output of
# gluster volume info
1. What version of Glusterfs are you using?
2. Were you able to figure out whether its stale data or metadata that is
causing the issue?
There have been patches merged in write-behind in recent past and one in
the works which address metadata consistency. Would like to understand
whether you've run into any of the already identified issues.
regards,
Raghavendra
Post by Paul Anderson
In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.
Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.
There are no signs of O_DIRECT use in the sqlite3 code that I can see.
I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Thanks again!!
Paul
On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure
data
Post by Raghavendra Gowdappa
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much appreciated.
Can you experiment with turning on/off various performance xlators?
Based on
Post by Raghavendra Gowdappa
earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would
suggest
Post by Raghavendra Gowdappa
turning off all performance xlators. You can refer [1] for a related
discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not
working.
Post by Raghavendra Gowdappa
Does glusterfs log file has any messages complaining about writes or
fsync
Post by Raghavendra Gowdappa
failing? Does your application use O_DIRECT? If yes, please note that
you
Post by Raghavendra Gowdappa
need to turn the option performance.strict-o-direct on for write-behind
to
Post by Raghavendra Gowdappa
honour O_DIRECT
Also, is it possible to identify nature of corruption - Data or
metadata?
Post by Raghavendra Gowdappa
More detailed explanation will help to RCA the issue.
Also, is your application running on a single mount or from multiple
mounts?
Post by Raghavendra Gowdappa
Can you collect strace of your application (strace -ff -T -p <pid> -o
<file>)? If possible can you also collect fuse-dump using option
--dump-fuse
Post by Raghavendra Gowdappa
while mounting glusterfs?
[1]
http://lists.gluster.org/pipermail/gluster-users/2018-Februa
ry/033503.html
Post by Raghavendra Gowdappa
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Paul Anderson
2018-03-06 16:52:46 UTC
Permalink
Raghavendra,

I've commited my tests case to https://github.com/powool/gluster.git -
it's grungy, and a work in progress, but I am happy to take change
suggestions, especially if it will save folks significant time.

For the rest, I'll reply inline below...

On Mon, Mar 5, 2018 at 10:39 PM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
+Csaba.
Post by Paul Anderson
Raghavendra,
Thanks very much for your reply.
I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.
Good to know it worked. Can you give us the output of
# gluster volume info
[***@node-1 /]# gluster volume info

Volume Name: dockerstore
Type: Replicate
Volume ID: fb08b9f4-0784-4534-9ed3-e01ff71a0144
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.18.0.4:/data/glusterfs/store/dockerstore
Brick2: 172.18.0.3:/data/glusterfs/store/dockerstore
Brick3: 172.18.0.2:/data/glusterfs/store/dockerstore
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
locks.mandatory-locking: optimal
performance.flush-behind: off
performance.write-behind: off
Post by Raghavendra Gowdappa
1. What version of Glusterfs are you using?
On the server nodes:

[***@node-1 /]# gluster --version
glusterfs 3.13.2
Repository revision: git://git.gluster.org/glusterfs.git

On the docker container sqlite test node:

***@b4055d8547d2:/# glusterfs --version
glusterfs 3.8.8 built on Jan 11 2017 14:07:11

I recognize that version skew could be an issue.
Post by Raghavendra Gowdappa
2. Were you able to figure out whether its stale data or metadata that is
causing the issue?
I lean towards stale data based on the only real observation I have:

While debugging, I put log messages in as to when the flock() is
acquired, and when it is released. There is no instance where two
different processes ever hold the same flock()'d file. From what I
have read, the locks are considered metadata, and they appear to me to
be working, so that's why I'm inclined to think stale data is the
issue.
Post by Raghavendra Gowdappa
There have been patches merged in write-behind in recent past and one in the
works which address metadata consistency. Would like to understand whether
you've run into any of the already identified issues.
Agreed!

Thanks,

Paul
Post by Raghavendra Gowdappa
regards,
Raghavendra
Post by Paul Anderson
In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.
Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.
There are no signs of O_DIRECT use in the sqlite3 code that I can see.
I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Thanks again!!
Paul
On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure data
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much appreciated.
Can you experiment with turning on/off various performance xlators? Based on
earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would suggest
turning off all performance xlators. You can refer [1] for a related
discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or fsync
failing? Does your application use O_DIRECT? If yes, please note that you
need to turn the option performance.strict-o-direct on for write-behind to
honour O_DIRECT
Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.
Also, is your application running on a single mount or from multiple mounts?
Can you collect strace of your application (strace -ff -T -p <pid> -o
<file>)? If possible can you also collect fuse-dump using option --dump-fuse
while mounting glusterfs?
[1]
http://lists.gluster.org/pipermail/gluster-users/2018-February/033503.html
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Raghavendra Gowdappa
2018-03-06 17:28:40 UTC
Permalink
Post by Paul Anderson
Raghavendra,
I've commited my tests case to https://github.com/powool/gluster.git -
it's grungy, and a work in progress, but I am happy to take change
suggestions, especially if it will save folks significant time.
For the rest, I'll reply inline below...
On Mon, Mar 5, 2018 at 10:39 PM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
+Csaba.
Post by Paul Anderson
Raghavendra,
Thanks very much for your reply.
I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.
Good to know it worked. Can you give us the output of
# gluster volume info
Volume Name: dockerstore
Type: Replicate
Volume ID: fb08b9f4-0784-4534-9ed3-e01ff71a0144
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: 172.18.0.4:/data/glusterfs/store/dockerstore
Brick2: 172.18.0.3:/data/glusterfs/store/dockerstore
Brick3: 172.18.0.2:/data/glusterfs/store/dockerstore
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
locks.mandatory-locking: optimal
performance.flush-behind: off
performance.write-behind: off
Post by Raghavendra Gowdappa
1. What version of Glusterfs are you using?
glusterfs 3.13.2
Repository revision: git://git.gluster.org/glusterfs.git
glusterfs 3.8.8 built on Jan 11 2017 14:07:11
I guess this is where client is mounted. If I am correct on where glusterfs
client is mounted, client is running quite a old version. There have been
significant number of fixes between 3.8.8 and current master. I would
suggest to try out 3.13.2 patched with [1]. If you get a chance to try this
out, please report back how did the tests go.

[1] https://review.gluster.org/19673
Post by Paul Anderson
I recognize that version skew could be an issue.
Post by Raghavendra Gowdappa
2. Were you able to figure out whether its stale data or metadata that is
causing the issue?
While debugging, I put log messages in as to when the flock() is
acquired, and when it is released. There is no instance where two
different processes ever hold the same flock()'d file. From what I
have read, the locks are considered metadata, and they appear to me to
be working, so that's why I'm inclined to think stale data is the
issue.
Post by Raghavendra Gowdappa
There have been patches merged in write-behind in recent past and one in
the
Post by Raghavendra Gowdappa
works which address metadata consistency. Would like to understand
whether
Post by Raghavendra Gowdappa
you've run into any of the already identified issues.
Agreed!
Thanks,
Paul
Post by Raghavendra Gowdappa
regards,
Raghavendra
Post by Paul Anderson
In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.
Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.
There are no signs of O_DIRECT use in the sqlite3 code that I can see.
I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Thanks again!!
Paul
On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish
on
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use
flock()
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php
test
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure data
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much
appreciated.
Can you experiment with turning on/off various performance xlators? Based on
earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would suggest
turning off all performance xlators. You can refer [1] for a related
discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or fsync
failing? Does your application use O_DIRECT? If yes, please note that you
need to turn the option performance.strict-o-direct on for
write-behind
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
to
honour O_DIRECT
Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.
Also, is your application running on a single mount or from multiple mounts?
Can you collect strace of your application (strace -ff -T -p <pid> -o
<file>)? If possible can you also collect fuse-dump using option --dump-fuse
while mounting glusterfs?
[1]
http://lists.gluster.org/pipermail/gluster-users/2018-
February/033503.html
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Raghavendra Gowdappa
2018-03-06 17:32:40 UTC
Permalink
Post by Raghavendra Gowdappa
Post by Paul Anderson
Raghavendra,
I've commited my tests case to https://github.com/powool/gluster.git -
it's grungy, and a work in progress, but I am happy to take change
suggestions, especially if it will save folks significant time.
For the rest, I'll reply inline below...
On Mon, Mar 5, 2018 at 10:39 PM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
+Csaba.
Post by Paul Anderson
Raghavendra,
Thanks very much for your reply.
I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.
Good to know it worked. Can you give us the output of
# gluster volume info
Volume Name: dockerstore
Type: Replicate
Volume ID: fb08b9f4-0784-4534-9ed3-e01ff71a0144
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: 172.18.0.4:/data/glusterfs/store/dockerstore
Brick2: 172.18.0.3:/data/glusterfs/store/dockerstore
Brick3: 172.18.0.2:/data/glusterfs/store/dockerstore
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
locks.mandatory-locking: optimal
performance.flush-behind: off
performance.write-behind: off
Post by Raghavendra Gowdappa
1. What version of Glusterfs are you using?
glusterfs 3.13.2
Repository revision: git://git.gluster.org/glusterfs.git
glusterfs 3.8.8 built on Jan 11 2017 14:07:11
I guess this is where client is mounted. If I am correct on where
glusterfs client is mounted, client is running quite a old version. There
have been significant number of fixes between 3.8.8 and current master.
... significant number of fixes to write-behind...

I would suggest to try out 3.13.2 patched with [1]. If you get a chance to
Post by Raghavendra Gowdappa
try this out, please report back how did the tests go.
I would suggest to try out 3.13.2 patched with [1] and run tests with
write-behind turned on.
Post by Raghavendra Gowdappa
[1] https://review.gluster.org/19673
Post by Paul Anderson
I recognize that version skew could be an issue.
Post by Raghavendra Gowdappa
2. Were you able to figure out whether its stale data or metadata that
is
Post by Raghavendra Gowdappa
causing the issue?
While debugging, I put log messages in as to when the flock() is
acquired, and when it is released. There is no instance where two
different processes ever hold the same flock()'d file. From what I
have read, the locks are considered metadata, and they appear to me to
be working, so that's why I'm inclined to think stale data is the
issue.
Post by Raghavendra Gowdappa
There have been patches merged in write-behind in recent past and one
in the
Post by Raghavendra Gowdappa
works which address metadata consistency. Would like to understand
whether
Post by Raghavendra Gowdappa
you've run into any of the already identified issues.
Agreed!
Thanks,
Paul
Post by Raghavendra Gowdappa
regards,
Raghavendra
Post by Paul Anderson
In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.
Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.
There are no signs of O_DIRECT use in the sqlite3 code that I can see.
I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Thanks again!!
Paul
On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish
on
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use
flock()
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems
after
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
a short test).
Is what we're trying to do achievable? We're testing using the
docker
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
container gluster/gluster-centos as the three servers, with a php
test
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
inside of php-cli using filesystem mounts. If we mount the gluster
FS
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out
why,
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes
won't
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure data
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much
appreciated.
Can you experiment with turning on/off various performance xlators? Based on
earlier issues, its likely that there is stale metadata which might
be
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
causing the issue (not necessarily improper fsync behavior). I would suggest
turning off all performance xlators. You can refer [1] for a related
discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or fsync
failing? Does your application use O_DIRECT? If yes, please note that you
need to turn the option performance.strict-o-direct on for
write-behind
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
to
honour O_DIRECT
Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.
Also, is your application running on a single mount or from multiple mounts?
Can you collect strace of your application (strace -ff -T -p <pid> -o
<file>)? If possible can you also collect fuse-dump using option --dump-fuse
while mounting glusterfs?
[1]
http://lists.gluster.org/pipermail/gluster-users/2018-Februa
ry/033503.html
Post by Raghavendra Gowdappa
Post by Paul Anderson
Post by Raghavendra Gowdappa
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Paul Anderson
2018-03-08 15:00:40 UTC
Permalink
I was able to get the docker containers I'm using to test with to
install the latest builds from gluster.org.

So client/server versions are both 3.13.2

I am testing two main cases, both using sqlite3. With a php program
wrapping all database operations with an flock(), it now works as
expected. I ran the same test 500 times (or so) yesterday afternoon,
and it worked every time.

I repeated that same test both with and without
performance.flush-behind/write-behind enabled with the same result.

So that's great!

When I ran my other test case, just allowing sqlite3 fcntl() style
locks to manage data, the test fails with either performance setting.

So it could be that sqlite3 is not correctly managing its lock and
flush operations, or it is possible gluster has a data integrity
problem in the case when fcntl() style locks are used. I have no way
of knowing which is more likely...

I think I've got what I need, so someone else is going to need to pick
up the ball if they want a sqlite3 lock to work on its own with
gluster. I will say that it is slow if a bunch of writers are trying
to update individual records at the same time, since the database is
ping-ponging all over the cluster as different clients get and hold
the lock.

I've updated my github repo with my latest changes if anyone feels
like trying it on their own: https://github.com/powool/gluster.git

My summary is: sqlite3 built in locks don't appear to work nicely with
gluster, so you have to put an flock() around the database operations
to prevent data loss. You also can't do any caching in your volume
mount on the client side. The performance settings server side appear
not to matter, provided you're up to date on client/server code.

I hope this helps someone!

Paul

On Tue, Mar 6, 2018 at 12:32 PM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Raghavendra,
I've commited my tests case to https://github.com/powool/gluster.git -
it's grungy, and a work in progress, but I am happy to take change
suggestions, especially if it will save folks significant time.
For the rest, I'll reply inline below...
On Mon, Mar 5, 2018 at 10:39 PM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
+Csaba.
Post by Paul Anderson
Raghavendra,
Thanks very much for your reply.
I fixed our data corruption problem by disabling the volume
performance.write-behind flag as you suggested, and simultaneously
disabling caching in my client side mount command.
Good to know it worked. Can you give us the output of
# gluster volume info
Volume Name: dockerstore
Type: Replicate
Volume ID: fb08b9f4-0784-4534-9ed3-e01ff71a0144
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: 172.18.0.4:/data/glusterfs/store/dockerstore
Brick2: 172.18.0.3:/data/glusterfs/store/dockerstore
Brick3: 172.18.0.2:/data/glusterfs/store/dockerstore
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
locks.mandatory-locking: optimal
performance.flush-behind: off
performance.write-behind: off
Post by Raghavendra Gowdappa
1. What version of Glusterfs are you using?
glusterfs 3.13.2
Repository revision: git://git.gluster.org/glusterfs.git
glusterfs 3.8.8 built on Jan 11 2017 14:07:11
I guess this is where client is mounted. If I am correct on where
glusterfs client is mounted, client is running quite a old version. There
have been significant number of fixes between 3.8.8 and current master.
... significant number of fixes to write-behind...
Post by Raghavendra Gowdappa
I would suggest to try out 3.13.2 patched with [1]. If you get a chance to
try this out, please report back how did the tests go.
I would suggest to try out 3.13.2 patched with [1] and run tests with
write-behind turned on.
Post by Raghavendra Gowdappa
[1] https://review.gluster.org/19673
Post by Paul Anderson
I recognize that version skew could be an issue.
Post by Raghavendra Gowdappa
2. Were you able to figure out whether its stale data or metadata that is
causing the issue?
While debugging, I put log messages in as to when the flock() is
acquired, and when it is released. There is no instance where two
different processes ever hold the same flock()'d file. From what I
have read, the locks are considered metadata, and they appear to me to
be working, so that's why I'm inclined to think stale data is the
issue.
Post by Raghavendra Gowdappa
There have been patches merged in write-behind in recent past and one in the
works which address metadata consistency. Would like to understand whether
you've run into any of the already identified issues.
Agreed!
Thanks,
Paul
Post by Raghavendra Gowdappa
regards,
Raghavendra
Post by Paul Anderson
In very modest testing, the flock() case appears to me to work well -
before it would corrupt the db within a few transactions.
Testing using built in sqlite3 locks is better (fcntl range locks),
but has some behavioral issues (probably just requires query retry
when the file is locked). I'll research this more, although the test
case is not critical to our use case.
There are no signs of O_DIRECT use in the sqlite3 code that I can see.
I intend to set up tests that run much longer than a few minutes, to
see if there are any longer term issues. Also, I want to experiment
with data durability by killing various gluster server nodes during
the tests.
If anyone would like our test scripts, I can either tar them up and
email them or put them in github - either is fine with me. (they rely
on current builds of docker and docker-compose)
Thanks again!!
Paul
On Mon, Mar 5, 2018 at 11:26 AM, Raghavendra Gowdappa
Post by Raghavendra Gowdappa
Post by Paul Anderson
Hi,
tl;dr summary of below: flock() works, but what does it take to make
sync()/fsync() work in a 3 node GFS cluster?
I am under the impression that POSIX flock, POSIX
fcntl(F_SETLK/F_GETLK,...), and POSIX read/write/sync/fsync are all
supported in cluster operations, such that in theory, SQLite3 should
be able to atomically lock the file (or a subset of page), modify
pages, flush the pages to gluster, then release the lock, and thus
satisfy the ACID property that SQLite3 appears to try to accomplish on
a local filesystem.
In a test we wrote that fires off 10 simple concurrernt SQL insert,
read, update loops, we discovered that we at least need to use flock()
around the SQLite3 db connection open/update/close to protect it.
However, that is not enough - although from testing, it looks like
flock() works as advertised across gluster mounted files, sync/fsync
don't appear to, so we end up getting corruption in the SQLite3 file
(pragma integrity_check generally will show a bunch of problems after
a short test).
Is what we're trying to do achievable? We're testing using the docker
container gluster/gluster-centos as the three servers, with a php test
inside of php-cli using filesystem mounts. If we mount the gluster FS
via sapk/plugin-gluster into the php-cli containers using docker, we
seem to have better success sometimes, but I haven't figured out why,
yet.
I did see that I needed to set the server volume parameter
'performance.flush-behind off', otherwise it seems that flushes won't
block as would be needed by SQLite3.
If you are relying on fsync this shouldn't matter as fsync makes sure
data
is synced to disk.
Post by Paul Anderson
Does anyone have any suggestions? Any words of widsom would be much
appreciated.
Can you experiment with turning on/off various performance xlators? Based on
earlier issues, its likely that there is stale metadata which might be
causing the issue (not necessarily improper fsync behavior). I would suggest
turning off all performance xlators. You can refer [1] for a related
discussion. In theory the only perf xlator relevant for fsync is
write-behind and I am not aware of any issues where fsync is not working.
Does glusterfs log file has any messages complaining about writes or fsync
failing? Does your application use O_DIRECT? If yes, please note
that
you
need to turn the option performance.strict-o-direct on for write-behind
to
honour O_DIRECT
Also, is it possible to identify nature of corruption - Data or metadata?
More detailed explanation will help to RCA the issue.
Also, is your application running on a single mount or from multiple mounts?
Can you collect strace of your application (strace -ff -T -p <pid> -o
<file>)? If possible can you also collect fuse-dump using option --dump-fuse
while mounting glusterfs?
[1]
http://lists.gluster.org/pipermail/gluster-users/2018-February/033503.html
Post by Paul Anderson
Thanks,
Paul
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Loading...