[Gluster-users] Atomic file updates

Discussion:

Tom Munro Glass

2014-02-12 21:02:01 UTC

I'm not currently a Gluster user but I'm hoping it's the answer to a
problem I'm working on.

I manage a private web site that is basically a reporting tool for
equipment located at several hundred sites. Each site regularly uploads
zipped XML files to a cloud based server and this also provides a web
interface to the data using apache/PHP. The problem I need to solve is
that with a single server disk I/O has become a bottleneck.

The plan is to use a load balancer and multiple web servers with a
4-node Gluster volume behind to store the data. Data would be replicated
over 2 nodes.

The uploaded files are stored and then unzipped ready for reading by the
web interface code. Each file is unzipped into a temporary file and then
renamed, e.g.

file1.xml.zip --unzip--> uniquename.tmp --rename--> file1.xml

Use of the rename function makes these updates atomic.

How can I achieve atomic updates in this way using a Gluster volume? My
understanding is that renaming a file on a Gluster volume causes a link
file to be created and that clearly wouldn't be appropriate where there
are frequent updates.

I could use flock, exclusive for writing and shared for reading, but too
many reading processes could potentially block writing.

Any advice will be much appreciated.

Tom

Jay Vyas

2014-02-12 21:24:59 UTC

Permalink

For vanilla apps that are doing stuff in gluster, you normally do it
through a fuse mount.

mount -t glusterfs localhost:HadoopVol /mnt/glusterfs

But in your case, you might want to do some strict consistency settings to
make it atomic:

mount -t glusterfs localhost:HadoopVol -o
entry-timeout=0,attribute-timeout=0/mnt/glusterfs

This will make sure that everything is refreshed when you look up files.
This strategy has solved our eventual consistency requirements for the
hadoop plugin.

Tom Munro Glass

2014-02-12 22:23:06 UTC

Permalink

Post by Jay Vyas
For vanilla apps that are doing stuff in gluster, you normally do it
through a fuse mount.
mount -t glusterfs localhost:HadoopVol /mnt/glusterfs
But in your case, you might want to do some strict consistency settings to
mount -t glusterfs localhost:HadoopVol -o
entry-timeout=0,attribute-timeout=0/mnt/glusterfs
This will make sure that everything is refreshed when you look up files.
This strategy has solved our eventual consistency requirements for the
hadoop plugin.

Are you saying that with these mount options I can just write files
directly without using flock or renaming a temporary file, and that
other processes trying to read the file will always see a complete and
consistent view of the file?

Tom

Jay Vyas

2014-02-12 23:20:03 UTC

Permalink

hola jeff: im not sure wether your volfile command is complimentary, or
alternative , to my simple and easy "mount with entry-timeout=0" option.

tom : Im not sure , lets wait for jeff, he's the hardcore gluster
consistency expert.

im just a user :)

Post by Tom Munro Glass

Post by Jay Vyas
mount -t glusterfs localhost:HadoopVol -o
entry-timeout=0,attribute-timeout=0/mnt/glusterfs
This will make sure that everything is refreshed when you look up files.
This strategy has solved our eventual consistency requirements for the
hadoop plugin.

--
Jay Vyas
http://jayunit100.blogspot.com

Jeff Darcy

2014-02-12 23:56:37 UTC

Permalink

Post by Tom Munro Glass
Are you saying that with these mount options I can just write files
directly without using flock or renaming a temporary file, and that
other processes trying to read the file will always see a complete and
consistent view of the file?

For write-once files, the rename is really the key to ensuring that
readers never see an incomplete file. If you ever rewrite a file in
place, you'll need flock to avoid reading a partially updated (i.e.
inconsistent) file. Jay's suggestions might also be helpful even
though they both have to do with metadata, because we use attributes
to determine when it's necessary to re-read a file that might have
changed. It's kind of up to you to determine which combination is
needed to meet your own consistency goals with your own workload.

Jeff Darcy

2014-02-12 22:19:52 UTC

Permalink

Post by Tom Munro Glass
I'm not currently a Gluster user but I'm hoping it's the answer to a
problem I'm working on.
I manage a private web site that is basically a reporting tool for
equipment located at several hundred sites. Each site regularly uploads
zipped XML files to a cloud based server and this also provides a web
interface to the data using apache/PHP. The problem I need to solve is
that with a single server disk I/O has become a bottleneck.
The plan is to use a load balancer and multiple web servers with a
4-node Gluster volume behind to store the data. Data would be replicated
over 2 nodes.
The uploaded files are stored and then unzipped ready for reading by the
web interface code. Each file is unzipped into a temporary file and then
renamed, e.g.
file1.xml.zip --unzip--> uniquename.tmp --rename--> file1.xml
Use of the rename function makes these updates atomic.
How can I achieve atomic updates in this way using a Gluster volume? My
understanding is that renaming a file on a Gluster volume causes a link
file to be created and that clearly wouldn't be appropriate where there
are frequent updates.

Creating a file with one name and then renaming it to another *might*
cause creation of linkfiles, but I think concerns about linkfiles are
often overblown. The one extra call to create a linkfile isn't much
compared to those for creating the file, writing into it, and then
renaming it even if the rename is local to one brick. What really
matters is the performance of the entire sequence, with or without the
linkfile.

That said, there's also a trick you can use to avoid creation of a
linkfile. Other tools, such as rsync and our own object interface,
use the same write-then-rename idiom. To serve them, there's an
option called extra-hash-regex that can be used to place files on the
"right" brick according to their final name even though they're created
with another. Unfortunately, specifying that option via the command line
doesn't seem to work (it creates a malformed volfile) so you have to
mount a bit differently. For example:

glusterfs --volfile-server=a_server --volfile-id=a_volume \
--xlator-option a_volume-dht.extra_hash_regex='(.*+)tmp' \
/a/mountpoint

The important part is that second line. That causes any file with a
"tmp" suffix to be hashed and placed as though only the part in the
first parenthesized part of the regex (i.e. without the "tmp") was
there. Therefore, creating "xxxtmp" and then renaming it to "xxx" is
the same as just creating "xxx" in the first place as far as linkfiles
etc. are concerned. Note that the excluded part can be anything that
a regex can match, including a unique random number. If I recall,
rsync uses temp files something like this:

fubar = .fubar.NNNNNN (where NNNNNNN is a random number)

I know this probably seems a little voodoo-ish, but with a little bit
of experimentation to find the right regex you should be able to avoid
those dreaded linkfiles altogether.

Tom Munro Glass

2014-02-13 01:38:34 UTC

Permalink

Hi Jeff - many thanks for your explanation.

Post by Jeff Darcy
Creating a file with one name and then renaming it to another *might*
cause creation of linkfiles, but I think concerns about linkfiles are
often overblown. The one extra call to create a linkfile isn't much
compared to those for creating the file, writing into it, and then
renaming it even if the rename is local to one brick. What really
matters is the performance of the entire sequence, with or without the
linkfile.

It's not the time overhead of creating a link file that I'm worried
about - it's making sure that I don't end up with millions of orphaned
link files, or link files pointing at other link files. I think the next
part of your message avoids this problem.

Post by Jeff Darcy
That said, there's also a trick you can use to avoid creation of a
linkfile. Other tools, such as rsync and our own object interface,
use the same write-then-rename idiom. To serve them, there's an
option called extra-hash-regex that can be used to place files on the
"right" brick according to their final name even though they're created
with another. Unfortunately, specifying that option via the command line
doesn't seem to work (it creates a malformed volfile) so you have to
glusterfs --volfile-server=a_server --volfile-id=a_volume \
--xlator-option a_volume-dht.extra_hash_regex='(.*+)tmp' \
/a/mountpoint
The important part is that second line. That causes any file with a
"tmp" suffix to be hashed and placed as though only the part in the
first parenthesized part of the regex (i.e. without the "tmp") was
there. Therefore, creating "xxxtmp" and then renaming it to "xxx" is
the same as just creating "xxx" in the first place as far as linkfiles
etc. are concerned. Note that the excluded part can be anything that
a regex can match, including a unique random number. If I recall,
fubar = .fubar.NNNNNN (where NNNNNNN is a random number)
I know this probably seems a little voodoo-ish, but with a little bit
of experimentation to find the right regex you should be able to avoid
those dreaded linkfiles altogether.

I think I mostly understand this. Assuming I implement the volume on 4
servers with 1 brick each and use replica 2, each file will be stored on
2 nodes. Web server clients mount the volume using the syntax you showed
above then when I need to update a file I should:

--write--> file1.xml.tmp --rename--> file1.xml

extra_hash_regex will cause file1.xml.tmp to be created on the 2 bricks
that file1.xml will end up on, and therefore the rename is atomic and a
link file isn't created. The main difference from what I'm doing now
seems to be that the first part of the temporary file needs to be
identical to the final file instead having a unique random name.

Is this correct?

BTW I'll be running this on CentOS 6.5 servers and it looks like the
repo has glusterfs-3.4.0.57rhs. Is this version new enough for this?

Tom

Jeff Darcy

2014-02-13 11:52:01 UTC

Permalink

Post by Tom Munro Glass
It's not the time overhead of creating a link file that I'm worried
about - it's making sure that I don't end up with millions of orphaned
link files, or link files pointing at other link files.

That shouldn't be a problem, because we manage the link files across
renames, deletes, rebalancing, etc. If an operation causes a linkfile
to become invalid or unnecessary we take care of that ourselves before
we consider the original operation complete.

Post by Tom Munro Glass
The main difference from what I'm doing now
seems to be that the first part of the temporary file needs to be
identical to the final file instead having a unique random name.

The temporary file name needs to be some kind of extension of the
final file name, so that we have that final file name in our hands to
generate the correct hash value (which we then use to place the file).
However, the final file name doesn't have to be at the beginning. It
can just as easily be at the end or even in the middle. The prefix
and/or suffix used to create the temporary file can be either fixed
or random, so long as there's some rule expressible as a regex that
can be used to separate the permanent part from the temporary ones.