Discussion:
[Gluster-users] Slow write times to gluster disk
Pat Haley
2017-04-07 18:37:06 UTC
Permalink
Hi,

We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
duplicator) to write a 4.3 GB file of zeros:

* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s

The gluser disk is 2 bricks joined together, no replication or anything
else. The hardware is (literally) the same:

* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare

Some additional information and more tests results (after changing the
log level):

glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)



*Create the file to /gdata (gluster)*
[***@mseas-data2 gdata]# dd if=/dev/zero of=/gdata/zero1 bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*

*Create the file to /home (ext4)*
[***@mseas-data2 gdata]# dd if=/dev/zero of=/home/zero1 bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*


Copy from /gdata to /gdata (gluster to gluster)
*[***@mseas-data2 gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww


*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
[***@mseas-data2 gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again



*Copy from /home to /home (ext4 to ext4)*
[***@mseas-data2 gdata]# dd if=/home/zero1 of=/home/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast


*Copy from /home to /home (ext4 to ext4)*
[***@mseas-data2 gdata]# dd if=/home/zero1 of=/home/zero3
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast


As a test, can we copy data directly to the xfs mountpoint (/mnt/brick1)
and bypass gluster?


Any help you could give us would be appreciated.

Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Ravishankar N
2017-04-08 04:58:49 UTC
Permalink
Hi Pat,

I'm assuming you are using gluster native (fuse mount). If it helps, you
could try mounting it via gluster NFS (gnfs) and then see if there is an
improvement in speed. Fuse mounts are slower than gnfs mounts but you
get the benefit of avoiding a single point of failure. Unlike fuse
mounts, if the gluster node containing the gnfs server goes down, all
mounts done using that node will fail). For fuse mounts, you could try
tweaking the write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs mounts,
you can achieve fail-over by using CTDB.

Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing the
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Pat Haley
2017-04-10 19:12:45 UTC
Permalink
Hi Ravi,

Thanks for the reply. And yes, we are using the gluster native (fuse)
mount. Since this is not my area of expertise I have a few questions
(mostly clarifications)

Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be looking
for additional issues? (Note the first dd test described below was run
on the server that hosts the file-systems so no network communication
was involved).

You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from fuse
to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?

My next question is to make sure I'm clear on the comment " if the
gluster node containing the gnfs server goes down, all mounts done using
that node will fail". If you have 2 servers, each 1 brick in the
over-all gluster FS, and one server fails, then for gnfs nothing on
either server is visible to other nodes while under fuse only the files
on the dead server are not visible. Is this what you meant?

Finally, you mention "even for gnfs mounts, you can achieve fail-over by
using CTDB". Do you know if CTDB would have any performance impact
(i.e. in a worst cast scenario could adding CTDB to gnfs erase the speed
benefits of going to gnfs in the first place)?

Thanks

Pat
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it helps,
you could try mounting it via gluster NFS (gnfs) and then see if there
is an improvement in speed. Fuse mounts are slower than gnfs mounts
but you get the benefit of avoiding a single point of failure. Unlike
fuse mounts, if the gluster node containing the gnfs server goes down,
all mounts done using that node will fail). For fuse mounts, you could
try tweaking the write-behind xlator settings to see if it helps. See
the performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs mounts,
you can achieve fail-over by using CTDB.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Ravishankar N
2017-04-11 04:21:21 UTC
Permalink
Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native (fuse)
mount. Since this is not my area of expertise I have a few questions
(mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no network
communication was involved).
Though both the gluster bricks and the mounts are on the same physical
machine in your setup, the I/O still passes through different layers of
kernel/user-space fuse stack although I don't know if 20x slow down on
gluster vs NFS share is normal. Why don't you try doing a gluster NFS
mount on the machine and try the dd test and compare it with the gluster
fuse mount results?
Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from fuse
to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?
You should test these out and find the answers yourself. :-)
Post by Pat Haley
My next question is to make sure I'm clear on the comment " if the
gluster node containing the gnfs server goes down, all mounts done
using that node will fail". If you have 2 servers, each 1 brick in
the over-all gluster FS, and one server fails, then for gnfs nothing
on either server is visible to other nodes while under fuse only the
files on the dead server are not visible. Is this what you meant?
Yes, for gnfs mounts, all I/O from various mounts go to the gnfs server
process (on the machine whose IP was used at the time of mounting) which
then sends the I/O to the brick processes. For fuse, the gluster fuse
mount itself talks directly to the bricks.
Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve fail-over
by using CTDB". Do you know if CTDB would have any performance impact
(i.e. in a worst cast scenario could adding CTDB to gnfs erase the
speed benefits of going to gnfs in the first place)?
I don't think it would. You can even achieve load balancing via CTDB to
use different gnfs servers for different clients. But I don't know if
this is needed/ helpful in your current setup where everything (bricks
and clients) seem to be on just one server.

-Ravi
Post by Pat Haley
Thanks
Pat
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it helps,
you could try mounting it via gluster NFS (gnfs) and then see if
there is an improvement in speed. Fuse mounts are slower than gnfs
mounts but you get the benefit of avoiding a single point of failure.
Unlike fuse mounts, if the gluster node containing the gnfs server
goes down, all mounts done using that node will fail). For fuse
mounts, you could try tweaking the write-behind xlator settings to
see if it helps. See the performance.write-behind and
performance.write-behind-window-size options in `gluster volume set
help`. Of course, even for gnfs mounts, you can achieve fail-over by
using CTDB.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card,
/mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-04-13 22:18:30 UTC
Permalink
Hi Ravi (and list),

We are planning on testing the NFS route to see what kind of speed-up we
get. A little research led us to the following:

https://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/

Is this correct path to take to mount 2 xfs volumes as a single gluster
file system volume? If not, what would be a better path?


Pat
Post by Ravishankar N
Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native
(fuse) mount. Since this is not my area of expertise I have a few
questions (mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no network
communication was involved).
Though both the gluster bricks and the mounts are on the same physical
machine in your setup, the I/O still passes through different layers
of kernel/user-space fuse stack although I don't know if 20x slow down
on gluster vs NFS share is normal. Why don't you try doing a gluster
NFS mount on the machine and try the dd test and compare it with the
gluster fuse mount results?
Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from
fuse to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?
You should test these out and find the answers yourself. :-)
Post by Pat Haley
My next question is to make sure I'm clear on the comment " if the
gluster node containing the gnfs server goes down, all mounts done
using that node will fail". If you have 2 servers, each 1 brick in
the over-all gluster FS, and one server fails, then for gnfs nothing
on either server is visible to other nodes while under fuse only the
files on the dead server are not visible. Is this what you meant?
Yes, for gnfs mounts, all I/O from various mounts go to the gnfs
server process (on the machine whose IP was used at the time of
mounting) which then sends the I/O to the brick processes. For fuse,
the gluster fuse mount itself talks directly to the bricks.
Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve fail-over
by using CTDB". Do you know if CTDB would have any performance
impact (i.e. in a worst cast scenario could adding CTDB to gnfs erase
the speed benefits of going to gnfs in the first place)?
I don't think it would. You can even achieve load balancing via CTDB
to use different gnfs servers for different clients. But I don't know
if this is needed/ helpful in your current setup where everything
(bricks and clients) seem to be on just one server.
-Ravi
Post by Pat Haley
Thanks
Pat
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it helps,
you could try mounting it via gluster NFS (gnfs) and then see if
there is an improvement in speed. Fuse mounts are slower than gnfs
mounts but you get the benefit of avoiding a single point of
failure. Unlike fuse mounts, if the gluster node containing the gnfs
server goes down, all mounts done using that node will fail). For
fuse mounts, you could try tweaking the write-behind xlator settings
to see if it helps. See the performance.write-behind and
performance.write-behind-window-size options in `gluster volume set
help`. Of course, even for gnfs mounts, you can achieve fail-over by
using CTDB.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Ravishankar N
2017-04-14 04:57:22 UTC
Permalink
I'm not sure if the version you are running (glusterfs 3.7.11 ) works
with NFS-Ganesha as the link seems to suggest version >=3.8 as a
per-requisite. Adding Soumya for help. If it is not supported, then you
might have to go the plain glusterNFS way.
Regards,
Ravi
Post by Pat Haley
Hi Ravi (and list),
We are planning on testing the NFS route to see what kind of speed-up
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/
Is this correct path to take to mount 2 xfs volumes as a single
gluster file system volume? If not, what would be a better path?
Pat
Post by Ravishankar N
Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native
(fuse) mount. Since this is not my area of expertise I have a few
questions (mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no
network communication was involved).
Though both the gluster bricks and the mounts are on the same
physical machine in your setup, the I/O still passes through
different layers of kernel/user-space fuse stack although I don't
know if 20x slow down on gluster vs NFS share is normal. Why don't
you try doing a gluster NFS mount on the machine and try the dd test
and compare it with the gluster fuse mount results?
Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from
fuse to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?
You should test these out and find the answers yourself. :-)
Post by Pat Haley
My next question is to make sure I'm clear on the comment " if the
gluster node containing the gnfs server goes down, all mounts done
using that node will fail". If you have 2 servers, each 1 brick in
the over-all gluster FS, and one server fails, then for gnfs nothing
on either server is visible to other nodes while under fuse only the
files on the dead server are not visible. Is this what you meant?
Yes, for gnfs mounts, all I/O from various mounts go to the gnfs
server process (on the machine whose IP was used at the time of
mounting) which then sends the I/O to the brick processes. For fuse,
the gluster fuse mount itself talks directly to the bricks.
Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve
fail-over by using CTDB". Do you know if CTDB would have any
performance impact (i.e. in a worst cast scenario could adding CTDB
to gnfs erase the speed benefits of going to gnfs in the first place)?
I don't think it would. You can even achieve load balancing via CTDB
to use different gnfs servers for different clients. But I don't know
if this is needed/ helpful in your current setup where everything
(bricks and clients) seem to be on just one server.
-Ravi
Post by Pat Haley
Thanks
Pat
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single point
of failure. Unlike fuse mounts, if the gluster node containing the
gnfs server goes down, all mounts done using that node will fail).
For fuse mounts, you could try tweaking the write-behind xlator
settings to see if it helps. See the performance.write-behind and
performance.write-behind-window-size options in `gluster volume set
help`. Of course, even for gnfs mounts, you can achieve fail-over
by using CTDB.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Soumya Koduri
2017-04-17 07:18:41 UTC
Permalink
Post by Ravishankar N
I'm not sure if the version you are running (glusterfs 3.7.11 ) works
with NFS-Ganesha as the link seems to suggest version >=3.8 as a
per-requisite. Adding Soumya for help. If it is not supported, then you
might have to go the plain glusterNFS way.
Even gluster 3.7.x shall work with NFS-Ganesha but the steps to
configure had changed from 3.8 and hence the pre-requisite was added in
the doc. IIUC, from your below mail, you would like to try NFS
(preferably gNFS but not NFS-Ganesha) which may perform better compared
to fuse mount. In that case, gNFS server comes up by default (till
release-3.7.x) and there are additional steps needed to export volume
via gNFS. Let me know if you have any issues accessing volumes via gNFS.

Regards,
Soumya
Post by Ravishankar N
Regards,
Ravi
Post by Pat Haley
Hi Ravi (and list),
We are planning on testing the NFS route to see what kind of speed-up
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/
Is this correct path to take to mount 2 xfs volumes as a single
gluster file system volume? If not, what would be a better path?
Pat
Post by Ravishankar N
Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native
(fuse) mount. Since this is not my area of expertise I have a few
questions (mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no
network communication was involved).
Though both the gluster bricks and the mounts are on the same
physical machine in your setup, the I/O still passes through
different layers of kernel/user-space fuse stack although I don't
know if 20x slow down on gluster vs NFS share is normal. Why don't
you try doing a gluster NFS mount on the machine and try the dd test
and compare it with the gluster fuse mount results?
Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would
you expect better speed improvements from switching the mounting
from fuse to gnfs or from tweaking the settings? Also are these
mutually exclusive or would the be additional benefits from both
switching to gfns and tweaking?
You should test these out and find the answers yourself. :-)
Post by Pat Haley
My next question is to make sure I'm clear on the comment " if the
gluster node containing the gnfs server goes down, all mounts done
using that node will fail". If you have 2 servers, each 1 brick in
the over-all gluster FS, and one server fails, then for gnfs nothing
on either server is visible to other nodes while under fuse only the
files on the dead server are not visible. Is this what you meant?
Yes, for gnfs mounts, all I/O from various mounts go to the gnfs
server process (on the machine whose IP was used at the time of
mounting) which then sends the I/O to the brick processes. For fuse,
the gluster fuse mount itself talks directly to the bricks.
Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve
fail-over by using CTDB". Do you know if CTDB would have any
performance impact (i.e. in a worst cast scenario could adding CTDB
to gnfs erase the speed benefits of going to gnfs in the first place)?
I don't think it would. You can even achieve load balancing via CTDB
to use different gnfs servers for different clients. But I don't know
if this is needed/ helpful in your current setup where everything
(bricks and clients) seem to be on just one server.
-Ravi
Post by Pat Haley
Thanks
Pat
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single point
of failure. Unlike fuse mounts, if the gluster node containing the
gnfs server goes down, all mounts done using that node will fail).
For fuse mounts, you could try tweaking the write-behind xlator
settings to see if it helps. See the performance.write-behind and
performance.write-behind-window-size options in `gluster volume set
help`. Of course, even for gnfs mounts, you can achieve fail-over
by using CTDB.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-04-14 06:50:54 UTC
Permalink
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it helps, you
could try mounting it via gluster NFS (gnfs) and then see if there is an
improvement in speed. Fuse mounts are slower than gnfs mounts but you get
the benefit of avoiding a single point of failure. Unlike fuse mounts, if
the gluster node containing the gnfs server goes down, all mounts done
using that node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size options
in `gluster volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than gNFS
servers?

Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?

You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.
Post by Ravishankar N
Thanks,
Ravi
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
- on NFS disk (/home): 9.5 Gb/s
- on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or anything
- one server with 70 hard disks and a hardware RAID card.
- 4 disks in a RAID-6 group (the NFS disk)
- 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
- 32 disks in another RAID-6 group (/mnt/brick2)
- 2 hot spare
Some additional information and more tests results (after changing the log
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast
gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time (gluster to gluster)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint (/mnt/brick1)
and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
Ravishankar N
2017-04-14 07:01:41 UTC
Permalink
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single
point of failure. Unlike fuse mounts, if the gluster node
containing the gnfs server goes down, all mounts done using that
node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs
mounts, you can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than
gNFS servers?
I have heard anecdotal evidence time and again on the ML and IRC, which
is why I wanted to compare it with NFS numbers on his setup.
Post by Ravishankar N
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi
Post by Ravishankar N
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk
when compared to writing to an NFS disk. Specifically when using
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3
3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
Pat Haley
2017-05-05 14:44:30 UTC
Permalink
Hi Pranith & Ravi,

A couple of quick questions

We have profile turned on. Are there specific queries we should make
that would help debug our configuration? (The default profile info was
previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
but I'm not sure if that is what you were looking for.)

We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in debugging this?

Thanks

Pat
Post by Ravishankar N
On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single
point of failure. Unlike fuse mounts, if the gluster node
containing the gnfs server goes down, all mounts done using that
node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs
mounts, you can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than
gNFS servers?
I have heard anecdotal evidence time and again on the ML and IRC,
which is why I wanted to compare it with NFS numbers on his setup.
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk
when compared to writing to an NFS disk. Specifically when using
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3
3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-05 14:58:21 UTC
Permalink
hi Pat,
Let us concentrate on the performance numbers part for now. We will
look at the permissions one after this?

As per the profile info, only 2.6% of the work-load is writes. There are
too many Lookups.

Would it be possible to get the data for just the dd test you were doing
earlier?
Post by Pat Haley
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should make that
would help debug our configuration? (The default profile info was
previously sent in http://lists.gluster.org/pipermail/gluster-users/2017-
May/030840.html but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We rediscovered
an issue we previously reported ( http://lists.gluster.org/
pipermail/gluster-users/2016-September/028289.html ) in that the NFS
mounted version was ignoring the group write permissions. What specific
information would be useful in debugging this?
Thanks
Pat
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it helps, you
could try mounting it via gluster NFS (gnfs) and then see if there is an
improvement in speed. Fuse mounts are slower than gnfs mounts but you get
the benefit of avoiding a single point of failure. Unlike fuse mounts, if
the gluster node containing the gnfs server goes down, all mounts done
using that node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs mounts, you
can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than gNFS
servers?
I have heard anecdotal evidence time and again on the ML and IRC, which is
why I wanted to compare it with NFS numbers on his setup.
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
You can follow https://gluster.readthedocs.io/en/latest/Administrator%
20Guide/Monitoring%20Workload/ to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi
Post by Ravishankar N
Thanks,
Ravi
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
- on NFS disk (/home): 9.5 Gb/s
- on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or anything
- one server with 70 hard disks and a hardware RAID card.
- 4 disks in a RAID-6 group (the NFS disk)
- 32 disks in a RAID-6 group (the max allowed by the card,
/mnt/brick1)
- 32 disks in another RAID-6 group (/mnt/brick2)
- 2 hot spare
Some additional information and more tests results (after changing the
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast
gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time (gluster to gluster)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint (/mnt/brick1)
and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
_______________________________________________ Gluster-users mailing
an/listinfo/gluster-users
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-05 15:12:31 UTC
Permalink
Hi Pranith,

I presume you are asking for some version of the profile data that just
shows the dd test (or a repeat of the dd test). If yes, how do I
extract just that data?

Thanks

Pat
Post by Pranith Kumar Karampuri
hi Pat,
Let us concentrate on the performance numbers part for now. We
will look at the permissions one after this?
As per the profile info, only 2.6% of the work-load is writes. There
are too many Lookups.
Would it be possible to get the data for just the dd test you were
doing earlier?
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should
make that would help debug our configuration? (The default
profile info was previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
<http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html>
but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
<http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html>
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in
debugging this?
Thanks
Pat
Post by Ravishankar N
On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If
it helps, you could try mounting it via gluster NFS (gnfs)
and then see if there is an improvement in speed. Fuse
mounts are slower than gnfs mounts but you get the benefit
of avoiding a single point of failure. Unlike fuse mounts,
if the gluster node containing the gnfs server goes down,
all mounts done using that node will fail). For fuse mounts,
you could try tweaking the write-behind xlator settings to
see if it helps. See the performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower
than gNFS servers?
I have heard anecdotal evidence time and again on the ML and IRC,
which is why I wanted to compare it with NFS numbers on his setup.
Pat,
I see that I am late to the thread, but do you happen to
have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk. Specifically
when using dd (data duplicator) to write a 4.3 GB file of
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s -
*3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30
times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30
times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Ravishankar N
2017-05-05 16:47:23 UTC
Permalink
Post by Pat Haley
Hi Pranith,
I presume you are asking for some version of the profile data that
just shows the dd test (or a repeat of the dd test). If yes, how do I
extract just that data?
Yes, that is what he is asking for. Just clear the existing profile info
using `gluster volume profile volname clear` and run the dd test once.
Then when you run profile info again, it should just give you the stats
for the dd test.
Post by Pat Haley
Thanks
Pat
Post by Pranith Kumar Karampuri
hi Pat,
Let us concentrate on the performance numbers part for now. We
will look at the permissions one after this?
As per the profile info, only 2.6% of the work-load is writes. There
are too many Lookups.
Would it be possible to get the data for just the dd test you were
doing earlier?
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should
make that would help debug our configuration? (The default
profile info was previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
<http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html>
but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
<http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html>
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in debugging this?
Thanks
Pat
Post by Ravishankar N
On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If
it helps, you could try mounting it via gluster NFS (gnfs)
and then see if there is an improvement in speed. Fuse
mounts are slower than gnfs mounts but you get the benefit
of avoiding a single point of failure. Unlike fuse mounts,
if the gluster node containing the gnfs server goes down,
all mounts done using that node will fail). For fuse
mounts, you could try tweaking the write-behind xlator
settings to see if it helps. See the
performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower
than gNFS servers?
I have heard anecdotal evidence time and again on the ML and
IRC, which is why I wanted to compare it with NFS numbers on his
setup.
Pat,
I see that I am late to the thread, but do you happen to
have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk. Specifically
when using dd (data duplicator) to write a 4.3 GB file of
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no
replication or anything else. The hardware is (literally)
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s -
*3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30
times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* -
30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-05-06 00:11:26 UTC
Permalink
Hi,

We redid the dd tests (this time using conv=sync oflag=sync to avoid
caching questions). The profile results are in

http://mseas.mit.edu/download/phaley/GlusterUsers/profile_gluster_fuse_test
Post by Ravishankar N
Post by Pat Haley
Hi Pranith,
I presume you are asking for some version of the profile data that
just shows the dd test (or a repeat of the dd test). If yes, how do
I extract just that data?
Yes, that is what he is asking for. Just clear the existing profile
info using `gluster volume profile volname clear` and run the dd test
once. Then when you run profile info again, it should just give you
the stats for the dd test.
Post by Pat Haley
Thanks
Pat
Post by Pranith Kumar Karampuri
hi Pat,
Let us concentrate on the performance numbers part for now. We
will look at the permissions one after this?
As per the profile info, only 2.6% of the work-load is writes. There
are too many Lookups.
Would it be possible to get the data for just the dd test you were
doing earlier?
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should
make that would help debug our configuration? (The default
profile info was previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
<http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html>
but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
<http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html>
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in debugging this?
Thanks
Pat
Post by Ravishankar N
On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If
it helps, you could try mounting it via gluster NFS (gnfs)
and then see if there is an improvement in speed. Fuse
mounts are slower than gnfs mounts but you get the benefit
of avoiding a single point of failure. Unlike fuse mounts,
if the gluster node containing the gnfs server goes down,
all mounts done using that node will fail). For fuse
mounts, you could try tweaking the write-behind xlator
settings to see if it helps. See the
performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are
slower than gNFS servers?
I have heard anecdotal evidence time and again on the ML and
IRC, which is why I wanted to compare it with NFS numbers on
his setup.
Pat,
I see that I am late to the thread, but do you happen to
have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk.
Specifically when using dd (data duplicator) to write a
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no
replication or anything else. The hardware is (literally)
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s -
*3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s*
- realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s*
- realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s
*30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* -
30 times as fast
As a test, can we copy data directly to the xfs
mountpoint (/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-05-10 14:32:48 UTC
Permalink
Hi,

We finally managed to do the dd tests for an NFS-mounted gluster file
system. The profile results during that test are in

http://mseas.mit.edu/download/phaley/GlusterUsers/profile_gluster_nfs_test

The summary of the dd tests are

* writing to gluster disk mounted with fuse: 5 Mb/s
* writing to gluster disk mounted with nfs: 200 Mb/s

Pat
Post by Pat Haley
Hi,
We redid the dd tests (this time using conv=sync oflag=sync to avoid
caching questions). The profile results are in
http://mseas.mit.edu/download/phaley/GlusterUsers/profile_gluster_fuse_test
Post by Ravishankar N
Post by Pat Haley
Hi Pranith,
I presume you are asking for some version of the profile data that
just shows the dd test (or a repeat of the dd test). If yes, how do
I extract just that data?
Yes, that is what he is asking for. Just clear the existing profile
info using `gluster volume profile volname clear` and run the dd test
once. Then when you run profile info again, it should just give you
the stats for the dd test.
Post by Pat Haley
Thanks
Pat
Post by Pranith Kumar Karampuri
hi Pat,
Let us concentrate on the performance numbers part for now.
We will look at the permissions one after this?
As per the profile info, only 2.6% of the work-load is writes.
There are too many Lookups.
Would it be possible to get the data for just the dd test you were
doing earlier?
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should
make that would help debug our configuration? (The default
profile info was previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
<http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html>
but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
<http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html>
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in debugging this?
Thanks
Pat
Post by Ravishankar N
On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount).
If it helps, you could try mounting it via gluster NFS
(gnfs) and then see if there is an improvement in speed.
Fuse mounts are slower than gnfs mounts but you get the
benefit of avoiding a single point of failure. Unlike
fuse mounts, if the gluster node containing the gnfs
server goes down, all mounts done using that node will
fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you
can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are
slower than gNFS servers?
I have heard anecdotal evidence time and again on the ML and
IRC, which is why I wanted to compare it with NFS numbers on
his setup.
Pat,
I see that I am late to the thread, but do you happen
to have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk.
Specifically when using dd (data duplicator) to write a
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no
replication or anything else. The hardware is
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
of=/gdata/zero1 bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s
- *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s*
- realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s*
- realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s
*30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* -
30 times as fast
As a test, can we copy data directly to the xfs
mountpoint (/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-10 15:44:04 UTC
Permalink
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Bricks:
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* Options Reconfigured:
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>

* nfs.export-volumes: off*

​I copied this from old thread from 2016. This is distribute volume. Did
you change any of the options in between?
Pat Haley
2017-05-10 15:47:17 UTC
Permalink
Here is what I see now:

[***@mseas-data2 ~]# gluster volume info

Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster
volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-10 15:53:55 UTC
Permalink
Could you let me know the speed without oflag=sync on both the mounts? No
need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume. Did
you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-10 16:05:25 UTC
Permalink
Without the oflag=sync and only a single test of each, the FUSE is going
faster than NFS:

FUSE:
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s


NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on both the mounts?
No need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users> ~]#
gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-10 16:15:44 UTC
Permalink
Okay good. At least this validates my doubts. Handling O_SYNC in gluster
NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.

Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is going
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts? No
need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume. Did
you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-10 16:45:04 UTC
Permalink
Hi Pranith,

Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.

I am also uncertain about how to interpret the results when we also add
the dd tests writing to the /home area (no gluster, still on the same
machine)

* dd test without oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s

Given that the non-gluster area is a RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster than to the non-gluster.

I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem. Was there anything useful in the profiles?

Pat
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where
as in case of NFS, there is no concept of open. NFS performs write
though a handle saying it needs to be a synchronous write, so write()
syscall is performed first then it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am suspecting that when multiple
threads do this write+fsync() operation on the same file, multiple
writes are batched together to be written do disk so the throughput on
the disk is increasing is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users> ~]#
gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-10 17:27:46 UTC
Permalink
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your answer
by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also add
the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each brick of
the gluster area is a RAID-6 of 32 disks, I would naively expect the writes
to the gluster area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs without any
gluster to a location that is not inside the brick but someother location
that is on same disk(s). If you are mounting the partition as the brick,
then we can write to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I read that fuse speed is lesser than
nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.

Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in gluster
NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is going
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts? No
need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume. Did
you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-10 21:18:26 UTC
Permalink
Hi Pranith,

Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but not as
fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).

Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we
also add the dd tests writing to the /home area (no gluster, still
on the same machine)
* dd test without oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively
expect the writes to the gluster area to be roughly 8x faster than
to the non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as
the brick, then we can write to a file inside .glusterfs directory,
something like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if fuse vs nfs
is part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So
wanted clarifications. Now that I got my clarifications where fuse
outperformed nfs without sync, we can resume testing as described
above and try to find what it is. Based on your email-id I am guessing
you are from Boston and I am from Bangalore so if you are okay with
doing this debugging for multiple days because of timezones, I will be
happy to help. Please be a bit patient with me, I am under a release
crunch but I am very curious with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information
that is not pertaining to dd so it is difficult to find the
contributions of dd. So I went through your post again and found
something I didn't pay much attention to earlier i.e. oflag=sync, so
did my own tests on my setup with FUSE so sent that reply.
Pat
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall
where as in case of NFS, there is no concept of open. NFS
performs write though a handle saying it needs to be a
synchronous write, so write() syscall is performed first then it
performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are
batched together to be written do disk so the throughput on the
disk is increasing is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each, the
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on both
the mounts? No need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from 2016. This is
distribute volume. Did you change any of the options in
between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-11 11:05:41 UTC
Permalink
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but not as
fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer
disks).
Okay, then 1.6Gb/s is what we need to target for, considering your volume
is just distribute. Is there any way you can do tests on similar hardware
but at a small scale? Just so we can run the workload to learn more about
the bottlenecks in the system? We can probably try to get the speed to
1.2Gb/s on your /home partition you were telling me yesterday. Let me know
if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also add
the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs without any
gluster to a location that is not inside the brick but someother location
that is on same disk(s). If you are mounting the partition as the brick,
then we can write to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I read that fuse speed is lesser than
nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in gluster
NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is going
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts?
No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-11 15:27:44 UTC
Permalink
Hi Pranith,

Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd
test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but
not as fast as I was expecting given the 1.2 Gb/s to the
no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on
similar hardware but at a small scale? Just so we can run the workload
to learn more about the bottlenecks in the system? We can probably try
to get the speed to 1.2Gb/s on your /home partition you were telling
me yesterday. Let me know if that is something you are okay to do.
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll
run your answer by some other people who are more familiar
with this.
I am also uncertain about how to interpret the results when
we also add the dd tests writing to the /home area (no
gluster, still on the same machine)
* dd test without oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while
each brick of the gluster area is a RAID-6 of 32 disks, I
would naively expect the writes to the gluster area to be
roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs
without any gluster to a location that is not inside the brick
but someother location that is on same disk(s). If you are
mounting the partition as the brick, then we can write to a file
inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if fuse vs
nfs is part of the problem.
I got interested in the post because I read that fuse speed is
lesser than nfs speed which is counter-intuitive to my
understanding. So wanted clarifications. Now that I got my
clarifications where fuse outperformed nfs without sync, we can
resume testing as described above and try to find what it is.
Based on your email-id I am guessing you are from Boston and I am
from Bangalore so if you are okay with doing this debugging for
multiple days because of timezones, I will be happy to help.
Please be a bit patient with me, I am under a release crunch but
I am very curious with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are
collecting the profiles from an active volume, so it has a lot of
information that is not pertaining to dd so it is difficult to
find the contributions of dd. So I went through your post again
and found something I didn't pay much attention to earlier i.e.
oflag=sync, so did my own tests on my setup with FUSE so sent
that reply.
Pat
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling
O_SYNC in gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then
each write syscall has to be written to disk as part of the
syscall where as in case of NFS, there is no concept of
open. NFS performs write though a handle saying it needs to
be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with
O_SYNC becomes write+fsync. I am suspecting that when
multiple threads do this write+fsync() operation on the same
file, multiple writes are batched together to be written do
disk so the throughput on the disk is increasing is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each,
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on
both the mounts? No need to collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from 2016. This is
distribute volume. Did you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-11 15:32:16 UTC
Permalink
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test. All
we have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but not as
fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer
disks).
Okay, then 1.6Gb/s is what we need to target for, considering your volume
is just distribute. Is there any way you can do tests on similar hardware
but at a small scale? Just so we can run the workload to learn more about
the bottlenecks in the system? We can probably try to get the speed to
1.2Gb/s on your /home partition you were telling me yesterday. Let me know
if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also add
the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs without any
gluster to a location that is not inside the brick but someother location
that is on same disk(s). If you are mounting the partition as the brick,
then we can write to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in gluster
NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts?
No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-11 16:02:38 UTC
Permalink
Hi Pranith,

The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2

The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0

Will this cause a problem with creating a volume under /home?

Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale
test. All we have is our production hardware.
You said something about /home partition which has lesser disks, we
can create plain distribute volume inside one of those directories.
After we are done, we can remove the setup. What do you say?
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried
the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster
but not as fast as I was expecting given the 1.2 Gb/s to the
no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering
your volume is just distribute. Is there any way you can do tests
on similar hardware but at a small scale? Just so we can run the
workload to learn more about the bottlenecks in the system? We
can probably try to get the speed to 1.2Gb/s on your /home
partition you were telling me yesterday. Let me know if that is
something you are okay to do.
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise).
I'll run your answer by some other people who are more
familiar with this.
I am also uncertain about how to interpret the results
when we also add the dd tests writing to the /home area
(no gluster, still on the same machine)
* dd test without oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple
tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks
while each brick of the gluster area is a RAID-6 of 32
disks, I would naively expect the writes to the gluster
area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using
nfs without any gluster to a location that is not inside the
brick but someother location that is on same disk(s). If you
are mounting the partition as the brick, then we can write
to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if
fuse vs nfs is part of the problem.
I got interested in the post because I read that fuse speed
is lesser than nfs speed which is counter-intuitive to my
understanding. So wanted clarifications. Now that I got my
clarifications where fuse outperformed nfs without sync, we
can resume testing as described above and try to find what
it is. Based on your email-id I am guessing you are from
Boston and I am from Bangalore so if you are okay with doing
this debugging for multiple days because of timezones, I
will be happy to help. Please be a bit patient with me, I am
under a release crunch but I am very curious with the
problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are
collecting the profiles from an active volume, so it has a
lot of information that is not pertaining to dd so it is
difficult to find the contributions of dd. So I went through
your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on
my setup with FUSE so sent that reply.
Pat
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling
O_SYNC in gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount
then each write syscall has to be written to disk as
part of the syscall where as in case of NFS, there is
no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write()
syscall is performed first then it performs fsync(). so
an write on an fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple
writes are batched together to be written do disk so
the throughput on the disk is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single test of
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync
on both the mounts? No need to collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from 2016.
This is distribute volume. Did you change any
of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-11 16:06:14 UTC
Permalink
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but not as
fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer
disks).
Okay, then 1.6Gb/s is what we need to target for, considering your volume
is just distribute. Is there any way you can do tests on similar hardware
but at a small scale? Just so we can run the workload to learn more about
the bottlenecks in the system? We can probably try to get the speed to
1.2Gb/s on your /home partition you were telling me yesterday. Let me know
if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also add
the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts?
No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-12 14:34:04 UTC
Permalink
Hi Pranith,

My question was about setting up a gluster volume on an ext4 partition.
I thought we had the bricks mounted as xfs for compatibility with gluster?

Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you
did on your new volume to confirm?
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small
scale test. All we have is our production hardware.
You said something about /home partition which has lesser disks,
we can create plain distribute volume inside one of those
directories. After we are done, we can remove the setup. What do
you say?
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I
tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster
than gluster but not as fast as I was expecting given
the 1.2 Gb/s to the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is there any way
you can do tests on similar hardware but at a small scale?
Just so we can run the workload to learn more about the
bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me
yesterday. Let me know if that is something you are okay to do.
Pat
On Wed, May 10, 2017 at 10:15 PM, Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of
expertise). I'll run your answer by some other
people who are more familiar with this.
I am also uncertain about how to interpret the
results when we also add the dd tests writing to
the /home area (no gluster, still on the same machine)
* dd test without oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4
disks while each brick of the gluster area is a
RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster
than to the non-gluster.
I think a better test is to try and write to a file
using nfs without any gluster to a location that is not
inside the brick but someother location that is on same
disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell
if fuse vs nfs is part of the problem.
I got interested in the post because I read that fuse
speed is lesser than nfs speed which is
counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where
fuse outperformed nfs without sync, we can resume
testing as described above and try to find what it is.
Based on your email-id I am guessing you are from
Boston and I am from Bangalore so if you are okay with
doing this debugging for multiple days because of
timezones, I will be happy to help. Please be a bit
patient with me, I am under a release crunch but I am
very curious with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we
are collecting the profiles from an active volume, so
it has a lot of information that is not pertaining to
dd so it is difficult to find the contributions of dd.
So I went through your post again and found something I
didn't pay much attention to earlier i.e. oflag=sync,
so did my own tests on my setup with FUSE so sent that
reply.
Pat
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts.
Handling O_SYNC in gluster NFS and fuse is a bit
different.
When application opens a file with O_SYNC on fuse
mount then each write syscall has to be written to
disk as part of the syscall where as in case of
NFS, there is no concept of open. NFS performs
write though a handle saying it needs to be a
synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple
writes are batched together to be written do disk
so the throughput on the disk is increasing is my
guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single test
mseas-data2(dri_nascar)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s,
575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s,
376 MB/s
On 05/10/2017 11:53 AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Could you let me know the speed without
oflag=sync on both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from
2016. This is distribute volume. Did you
change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-13 03:14:12 UTC
Permalink
Post by Pat Haley
Hi Pranith,
My question was about setting up a gluster volume on an ext4 partition. I
thought we had the bricks mounted as xfs for compatibility with gluster?
Oh that should not be a problem. It works fine.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also
add the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively expect
the writes to the gluster area to be roughly 8x faster than to the
non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pranith Kumar Karampuri
2017-05-13 03:17:11 UTC
Permalink
On Sat, May 13, 2017 at 8:44 AM, Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Post by Pat Haley
Hi Pranith,
My question was about setting up a gluster volume on an ext4 partition.
I thought we had the bricks mounted as xfs for compatibility with gluster?
Oh that should not be a problem. It works fine.
Just that xfs doesn't have limits for anything, where as ext4 does for
things like hardlinks etc(At least last time I checked :-) ). So it is
better to have xfs.
Post by Pranith Kumar Karampuri
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd
test writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also
add the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively expect
the writes to the gluster area to be roughly 8x faster than to the
non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
Pranith
Ben Turner
2017-05-15 01:24:53 UTC
Permalink
----- Original Message -----
Sent: Friday, May 12, 2017 11:17:11 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
On Sat, May 13, 2017 at 8:44 AM, Pranith Kumar Karampuri <
Hi Pranith,
My question was about setting up a gluster volume on an ext4 partition. I
thought we had the bricks mounted as xfs for compatibility with gluster?
Oh that should not be a problem. It works fine.
Just that xfs doesn't have limits for anything, where as ext4 does for things
like hardlinks etc(At least last time I checked :-) ). So it is better to
have xfs.
One of the biggest reasons to use XFS IMHO is that most of the testing / large scale deployments(at least that I know of) / etc are done using XFS as a backend. While EXT4 should work I don't think that it has the same level of testing as XFS.

-b
Pat
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?
Pat
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we are
done, we can remove the setup. What do you say?
Pat
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but not as
fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer
disks).
Okay, then 1.6Gb/s is what we need to target for, considering your volume is
just distribute. Is there any way you can do tests on similar hardware but
at a small scale? Just so we can run the workload to learn more about the
bottlenecks in the system? We can probably try to get the speed to 1.2Gb/s
on your /home partition you were telling me yesterday. Let me know if that
is something you are okay to do.
Pat
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your answer by
some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also add the
dd tests writing to the /home area (no gluster, still on the same machine)
* dd test without oflag=sync (rough average of multiple tests)
* gluster w/ fuse mount : 570 Mb/s
* gluster w/ nfs mount: 390 Mb/s
* nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
* gluster w/ fuse mount: 5 Mb/s
* gluster w/ nfs mount: 200 Mb/s
* nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each brick of
the gluster area is a RAID-6 of 32 disks, I would naively expect the writes
to the gluster area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs without any
gluster to a location that is not inside the brick but someother location
that is on same disk(s). If you are mounting the partition as the brick,
then we can write to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if fuse vs nfs is part of
the problem.
I got interested in the post because I read that fuse speed is lesser than
nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed nfs
without sync, we can resume testing as described above and try to find what
it is. Based on your email-id I am guessing you are from Boston and I am
from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much attention
to earlier i.e. oflag=sync, so did my own tests on my setup with FUSE so
sent that reply.
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in gluster NFS
and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case of
NFS, there is no concept of open. NFS performs write though a handle saying
it needs to be a synchronous write, so write() syscall is performed first
then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each, the FUSE is going
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts? No
need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
[ root at mseas-data2 ~]# gluster volume info > > Volume Name: data-volume
Type: Distribute > Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >
Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: >
Brick1: mseas-data2:/mnt/brick1 > Brick2: mseas-data2:/mnt/brick2 >
Options Reconfigured: > performance.readdir-ahead: on > nfs.disable: on >
nfs.export-volumes: off
​I copied this from old thread from 2016. This is distribute volume. Did you
change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
Pranith
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Pat Haley
2017-05-16 15:50:35 UTC
Permalink
Hi Pranith,

Sorry for the delay. I never saw received your reply (but I did receive
Ben Turner's follow-up to your reply). So we tried to create a gluster
volume under /home using different variations of

gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp

However we keep getting errors of the form

Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>

Any thoughts on what we're doing wrong?

Also do you have a list of the test we should be running once we get
this volume created? Given the time-zone difference it might help if we
can run a small battery of tests and post the results rather than
test-post-new test-post... .

Thanks

Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you
did on your new volume to confirm?
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small
scale test. All we have is our production hardware.
You said something about /home partition which has lesser disks,
we can create plain distribute volume inside one of those
directories. After we are done, we can remove the setup. What do
you say?
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I
tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster
than gluster but not as fast as I was expecting given
the 1.2 Gb/s to the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is there any way
you can do tests on similar hardware but at a small scale?
Just so we can run the workload to learn more about the
bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me
yesterday. Let me know if that is something you are okay to do.
Pat
On Wed, May 10, 2017 at 10:15 PM, Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of
expertise). I'll run your answer by some other
people who are more familiar with this.
I am also uncertain about how to interpret the
results when we also add the dd tests writing to
the /home area (no gluster, still on the same machine)
* dd test without oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4
disks while each brick of the gluster area is a
RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster
than to the non-gluster.
I think a better test is to try and write to a file
using nfs without any gluster to a location that is not
inside the brick but someother location that is on same
disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell
if fuse vs nfs is part of the problem.
I got interested in the post because I read that fuse
speed is lesser than nfs speed which is
counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where
fuse outperformed nfs without sync, we can resume
testing as described above and try to find what it is.
Based on your email-id I am guessing you are from
Boston and I am from Bangalore so if you are okay with
doing this debugging for multiple days because of
timezones, I will be happy to help. Please be a bit
patient with me, I am under a release crunch but I am
very curious with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we
are collecting the profiles from an active volume, so
it has a lot of information that is not pertaining to
dd so it is difficult to find the contributions of dd.
So I went through your post again and found something I
didn't pay much attention to earlier i.e. oflag=sync,
so did my own tests on my setup with FUSE so sent that
reply.
Pat
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts.
Handling O_SYNC in gluster NFS and fuse is a bit
different.
When application opens a file with O_SYNC on fuse
mount then each write syscall has to be written to
disk as part of the syscall where as in case of
NFS, there is no concept of open. NFS performs
write though a handle saying it needs to be a
synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple
writes are batched together to be written do disk
so the throughput on the disk is increasing is my
guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single test
mseas-data2(dri_nascar)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s,
575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s,
376 MB/s
On 05/10/2017 11:53 AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Could you let me know the speed without
oflag=sync on both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from
2016. This is distribute volume. Did you
change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-17 09:01:04 UTC
Permalink
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I did receive
Ben Turner's follow-up to your reply). So we tried to create a gluster
volume under /home using different variations of
gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways, transport
tcp is the default, so no need to specify so remove those two words from
the CLI.
Post by Pat Haley
Also do you have a list of the test we should be running once we get this
volume created? Given the time-zone difference it might help if we can run
a small battery of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on users as far as I
remember. In our team there are separate engineers who do these tests. Ben
who replied earlier is one such engineer.

Ben,
Have any suggestions?
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also
add the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively expect
the writes to the gluster area to be roughly 8x faster than to the
non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-30 15:46:18 UTC
Permalink
Hi Pranith,

Thanks for the tip. We now have the gluster volume mounted under
/home. What tests do you recommend we run?

Thanks

Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I did
receive Ben Turner's follow-up to your reply). So we tried to
create a gluster volume under /home using different variations of
gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove those
two words from the CLI.
Also do you have a list of the test we should be running once we
get this volume created? Given the time-zone difference it might
help if we can run a small battery of tests and post the results
rather than test-post-new test-post... .
This is the first time I am doing performance analysis on users as far
as I remember. In our team there are separate engineers who do these
tests. Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests
you did on your new volume to confirm?
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a
small scale test. All we have is our production hardware.
You said something about /home partition which has lesser
disks, we can create plain distribute volume inside one of
those directories. After we are done, we can remove the
setup. What do you say?
Pat
On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks,
I tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s
(faster than gluster but not as fast as I was
expecting given the 1.2 Gb/s to the no-gluster area
w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is there
any way you can do tests on similar hardware but at a
small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on your /home
partition you were telling me yesterday. Let me know if
that is something you are okay to do.
Pat
On Wed, May 10, 2017 at 10:15 PM, Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of
expertise). I'll run your answer by some other
people who are more familiar with this.
I am also uncertain about how to interpret the
results when we also add the dd tests writing
to the /home area (no gluster, still on the
same machine)
* dd test without oflag=sync (rough average
of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of
4 disks while each brick of the gluster area
is a RAID-6 of 32 disks, I would naively
expect the writes to the gluster area to be
roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a
file using nfs without any gluster to a location
that is not inside the brick but someother
location that is on same disk(s). If you are
mounting the partition as the brick, then we can
write to a file inside .glusterfs directory,
something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't
tell if fuse vs nfs is part of the problem.
I got interested in the post because I read that
fuse speed is lesser than nfs speed which is
counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications
where fuse outperformed nfs without sync, we can
resume testing as described above and try to find
what it is. Based on your email-id I am guessing
you are from Boston and I am from Bangalore so if
you are okay with doing this debugging for
multiple days because of timezones, I will be
happy to help. Please be a bit patient with me, I
am under a release crunch but I am very curious
with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I
think we are collecting the profiles from an
active volume, so it has a lot of information that
is not pertaining to dd so it is difficult to find
the contributions of dd. So I went through your
post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my
own tests on my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith Kumar
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts.
Handling O_SYNC in gluster NFS and fuse is a
bit different.
When application opens a file with O_SYNC on
fuse mount then each write syscall has to be
written to disk as part of the syscall where
as in case of NFS, there is no concept of
open. NFS performs write though a handle
saying it needs to be a synchronous write, so
write() syscall is performed first then it
performs fsync(). so an write on an fd with
O_SYNC becomes write+fsync. I am suspecting
that when multiple threads do this
write+fsync() operation on the same file,
multiple writes are batched together to be
written do disk so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single
test of each, the FUSE is going faster
mseas-data2(dri_nascar)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961
s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264
s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Could you let me know the speed without
oflag=sync on both the mounts? No need
to collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat
info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from
2016. This is distribute volume.
Did you change any of the options
in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-30 16:10:56 UTC
Permalink
Let's start with the same 'dd' test we were testing with to see, what the
numbers are. Please provide profile numbers for the same. From there on we
will start tuning the volume to see what we can do.
Post by Pat Haley
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under /home.
What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I did receive
Ben Turner's follow-up to your reply). So we tried to create a gluster
volume under /home using different variations of
gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways, transport
tcp is the default, so no need to specify so remove those two words from
the CLI.
Post by Pat Haley
Also do you have a list of the test we should be running once we get this
volume created? Given the time-zone difference it might help if we can run
a small battery of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on users as far as
I remember. In our team there are separate engineers who do these tests.
Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd
test writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also
add the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively expect
the writes to the gluster area to be roughly 8x faster than to the
non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-30 17:06:51 UTC
Permalink
Hi Pranith,

I ran the same 'dd' test both in the gluster test volume and in the
.glusterfs directory of each brick. The median results (12 dd trials in
each test) are similar to before

* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s

The profile for the gluster test-volume is in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt

Thanks

Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see, what
the numbers are. Please provide profile numbers for the same. From
there on we will start tuning the volume to see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under
/home. What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I
did receive Ben Turner's follow-up to your reply). So we
tried to create a gluster volume under /home using different
variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove
those two words from the CLI.
Also do you have a list of the test we should be running once
we get this volume created? Given the time-zone difference
it might help if we can run a small battery of tests and post
the results rather than test-post-new test-post... .
This is the first time I am doing performance analysis on users
as far as I remember. In our team there are separate engineers
who do these tests. Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same
tests you did on your new volume to confirm?
Pat
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a
small scale test. All we have is our production
hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute volume
inside one of those directories. After we are done, we
can remove the setup. What do you say?
Pat
On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as the
bricks, I tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s
(faster than gluster but not as fast as I was
expecting given the 1.2 Gb/s to the no-gluster
area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is
there any way you can do tests on similar hardware
but at a small scale? Just so we can run the
workload to learn more about the bottlenecks in
the system? We can probably try to get the speed
to 1.2Gb/s on your /home partition you were
telling me yesterday. Let me know if that is
something you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM, Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of
expertise). I'll run your answer by some
other people who are more familiar with this.
I am also uncertain about how to
interpret the results when we also add
the dd tests writing to the /home area
(no gluster, still on the same machine)
* dd test without oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick of the
gluster area is a RAID-6 of 32 disks, I
would naively expect the writes to the
gluster area to be roughly 8x faster than
to the non-gluster.
I think a better test is to try and write to
a file using nfs without any gluster to a
location that is not inside the brick but
someother location that is on same disk(s).
If you are mounting the partition as the
brick, then we can write to a file inside
.glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I
can't tell if fuse vs nfs is part of the
problem.
I got interested in the post because I read
that fuse speed is lesser than nfs speed
which is counter-intuitive to my
understanding. So wanted clarifications. Now
that I got my clarifications where fuse
outperformed nfs without sync, we can resume
testing as described above and try to find
what it is. Based on your email-id I am
guessing you are from Boston and I am from
Bangalore so if you are okay with doing this
debugging for multiple days because of
timezones, I will be happy to help. Please be
a bit patient with me, I am under a release
crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I
think we are collecting the profiles from an
active volume, so it has a lot of information
that is not pertaining to dd so it is
difficult to find the contributions of dd. So
I went through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own tests
on my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith Kumar
Post by Pranith Kumar Karampuri
Okay good. At least this validates my
doubts. Handling O_SYNC in gluster NFS
and fuse is a bit different.
When application opens a file with
O_SYNC on fuse mount then each write
syscall has to be written to disk as
part of the syscall where as in case of
NFS, there is no concept of open. NFS
performs write though a handle saying it
needs to be a synchronous write, so
write() syscall is performed first then
it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do
this write+fsync() operation on the same
file, multiple writes are batched
together to be written do disk so the
throughput on the disk is increasing is
my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat
Without the oflag=sync and only a
single test of each, the FUSE is
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied,
7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied,
11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith
Post by Pranith Kumar Karampuri
Could you let me know the speed
without oflag=sync on both the
mounts? No need to collect profiles.
On Wed, May 10, 2017 at 9:17 PM,
volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread
from 2016. This is distribute
volume. Did you change any of
the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-31 01:27:26 UTC
Permalink
Pat,
What is the command you used? As per the following output, it seems
like at least one write operation took 16 seconds. Which is really bad.

96.39 1165.10 us 89.00 us *16487014.00 us* 393212
WRITE
Post by Pat Haley
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in the
.glusterfs directory of each brick. The median results (12 dd trials in
each test) are similar to before
- gluster test volume: 586.5 MB/s
- bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/
profile_testvol_gluster.txt
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see, what the
numbers are. Please provide profile numbers for the same. From there on we
will start tuning the volume to see what we can do.
Post by Pat Haley
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under /home.
What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I did receive
Ben Turner's follow-up to your reply). So we tried to create a gluster
volume under /home using different variations of
gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove those two
words from the CLI.
Post by Pat Haley
Also do you have a list of the test we should be running once we get
this volume created? Given the time-zone difference it might help if we
can run a small battery of tests and post the results rather than
test-post-new test-post... .
This is the first time I am doing performance analysis on users as far as
I remember. In our team there are separate engineers who do these tests.
Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd
test writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also
add the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively expect
the writes to the gluster area to be roughly 8x faster than to the
non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-31 01:40:34 UTC
Permalink
Hi Pranith,

The "dd" command was:

dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync

There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt

Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM, Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run your
answer by some other people who are
more familiar with this.
I am also uncertain about how to
interpret the results when we also
add the dd tests writing to the
/home area (no gluster, still on the
same machine)
* dd test without oflag=sync
(rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of
32 disks, I would naively expect the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without any
gluster to a location that is not inside
the brick but someother location that is
on same disk(s). If you are mounting the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue,
I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I
read that fuse speed is lesser than nfs
speed which is counter-intuitive to my
understanding. So wanted clarifications.
Now that I got my clarifications where
fuse outperformed nfs without sync, we
can resume testing as described above
and try to find what it is. Based on
your email-id I am guessing you are from
Boston and I am from Bangalore so if you
are okay with doing this debugging for
multiple days because of timezones, I
will be happy to help. Please be a bit
patient with me, I am under a release
crunch but I am very curious with the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help me
much, I think we are collecting the
profiles from an active volume, so it
has a lot of information that is not
pertaining to dd so it is difficult to
find the contributions of dd. So I went
through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own
tests on my setup with FUSE so sent that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync on
both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
on
nfs.exports-auth-enable: on
WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM,
Post by Pranith Kumar Karampuri
Is this the volume info
you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18
on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old
thread from 2016. This is
distribute volume. Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-05-31 01:54:34 UTC
Permalink
Thanks this is good information.

+Soumya

Soumya,
We are trying to find why kNFS is performing way better than plain
distribute glusterfs+fuse. What information do you think will benefit us to
compare the operations with kNFS vs gluster+fuse? We already have profile
output from fuse.
Post by Pat Haley
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/
dd_testvol_gluster.txt
Pat
Pat,
What is the command you used? As per the following output, it seems
like at least one write operation took 16 seconds. Which is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us* 393212 WRITE
Post by Pat Haley
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in the
.glusterfs directory of each brick. The median results (12 dd trials in
each test) are similar to before
- gluster test volume: 586.5 MB/s
- bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/pr
ofile_testvol_gluster.txt
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see, what the
numbers are. Please provide profile numbers for the same. From there on we
will start tuning the volume to see what we can do.
Post by Pat Haley
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under
/home. What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I did
receive Ben Turner's follow-up to your reply). So we tried to create a
gluster volume under /home using different variations of
gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove those two
words from the CLI.
Post by Pat Haley
Also do you have a list of the test we should be running once we get
this volume created? Given the time-zone difference it might help if we
can run a small battery of tests and post the results rather than
test-post-new test-post... .
This is the first time I am doing performance analysis on users as far
as I remember. In our team there are separate engineers who do these tests.
Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale
test. All we have is our production hardware.
You said something about /home partition which has lesser disks, we
can create plain distribute volume inside one of those directories. After
we are done, we can remove the setup. What do you say?
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd
test writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.
Post by Pat Haley
Pat
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also
add the dd tests writing to the /home area (no gluster, still on the same
machine)
- dd test without oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount : 570 Mb/s
- gluster w/ nfs mount: 390 Mb/s
- nfs (no gluster): 1.2 Gb/s
- dd test with oflag=sync (rough average of multiple tests)
- gluster w/ fuse mount: 5 Mb/s
- gluster w/ nfs mount: 200 Mb/s
- nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively expect
the writes to the gluster area to be roughly 8x faster than to the
non-gluster.
I think a better test is to try and write to a file using nfs
without any gluster to a location that is not inside the brick but
someother location that is on same disk(s). If you are mounting the
partition as the brick, then we can write to a file inside .glusterfs
directory, something like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.
I got interested in the post because I read that fuse speed is
lesser than nfs speed which is counter-intuitive to my understanding. So
wanted clarifications. Now that I got my clarifications where fuse
outperformed nfs without sync, we can resume testing as described above and
try to find what it is. Based on your email-id I am guessing you are from
Boston and I am from Bangalore so if you are okay with doing this debugging
for multiple days because of timezones, I will be happy to help. Please be
a bit patient with me, I am under a release crunch but I am very curious
with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are
collecting the profiles from an active volume, so it has a lot of
information that is not pertaining to dd so it is difficult to find the
contributions of dd. So I went through your post again and found something
I didn't pay much attention to earlier i.e. oflag=sync, so did my own tests
on my setup with FUSE so sent that reply.
Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?
* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info
*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
​I copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Soumya Koduri
2017-05-31 10:56:26 UTC
Permalink
Post by Pranith Kumar Karampuri
Thanks this is good information.
+Soumya
Soumya,
We are trying to find why kNFS is performing way better than
plain distribute glusterfs+fuse. What information do you think will
benefit us to compare the operations with kNFS vs gluster+fuse? We
already have profile output from fuse.
Could be because all operations done by kNFS are local to the system.
The operations done by FUSE mount over network could be more in number
and time-consuming than the ones sent by NFS-client. We could compare
and examine the pattern from tcpump taken over fuse-mount and NFS-mount.
Also nfsstat [1] may give some clue.

Sorry I hadn't followed this mail from the beginning. But is this
comparison between single brick volume and kNFS exporting that brick?
Otherwise its not a fair comparison if the volume is replicated or
distributed.

Thanks,
Soumya

[1] https://linux.die.net/man/8/nfsstat
Post by Pranith Kumar Karampuri
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from
the dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it seems like at least one write operation took 16 seconds. Which
is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us* 393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in the .glusterfs directory of each brick. The median results
(12 dd trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see, what the numbers are. Please provide profile numbers for
the same. From there on we will start tuning the volume to
see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply (but I did receive Ben Turner's follow-up to
your reply). So we tried to create a gluster volume
under /home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify so remove those two words from the CLI.
Also do you have a list of the test we should be
running once we get this volume created? Given the
time-zone difference it might help if we can run a
small battery of tests and post the results rather
than test-post-new test-post... .
This is the first time I am doing performance analysis
on users as far as I remember. In our team there are
separate engineers who do these tests. Ben who replied
earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4
defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume under /home?
I don't think the bottleneck is disk. You can do
the same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which
has lesser disks, we can create plain
distribute volume inside one of those
directories. After we are done, we can remove
the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Hi Pranith,
Since we are mounting the partitions
as the bricks, I tried the dd test
writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the
1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to
target for, considering your volume is
just distribute. Is there any way you can
do tests on similar hardware but at a
small scale? Just so we can run the
workload to learn more about the
bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s
on your /home partition you were telling
me yesterday. Let me know if that is
something you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your answer by some other people
who are more familiar with this.
I am also uncertain about how to
interpret the results when we
also add the dd tests writing to
the /home area (no gluster,
still on the same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570 Mb/s
390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync
(rough average of multiple
tests)
5 Mb/s
200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area
is a RAID-6 of 4 disks while
each brick of the gluster area
is a RAID-6 of 32 disks, I would
naively expect the writes to the
gluster area to be roughly 8x
faster than to the non-gluster.
I think a better test is to try and
write to a file using nfs without
any gluster to a location that is
not inside the brick but someother
location that is on same disk(s). If
you are mounting the partition as
the brick, then we can write to a
file inside .glusterfs directory,
something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue, I can't tell if fuse vs
nfs is part of the problem.
I got interested in the post because
I read that fuse speed is lesser
than nfs speed which is
counter-intuitive to my
understanding. So wanted
clarifications. Now that I got my
clarifications where fuse
outperformed nfs without sync, we
can resume testing as described
above and try to find what it is.
Based on your email-id I am guessing
you are from Boston and I am from
Bangalore so if you are okay with
doing this debugging for multiple
days because of timezones, I will be
happy to help. Please be a bit
patient with me, I am under a
release crunch but I am very curious
with the problem you posted.
Was there anything useful in
the profiles?
Unfortunately profiles didn't help
me much, I think we are collecting
the profiles from an active volume,
so it has a lot of information that
is not pertaining to dd so it is
difficult to find the contributions
of dd. So I went through your post
again and found something I didn't
pay much attention to earlier i.e.
oflag=sync, so did my own tests on
my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this
validates my doubts. Handling
O_SYNC in gluster NFS and fuse
is a bit different.
When application opens a file
with O_SYNC on fuse mount then
each write syscall has to be
written to disk as part of the
syscall where as in case of
NFS, there is no concept of
open. NFS performs write though
a handle saying it needs to be
a synchronous write, so write()
syscall is performed first then
it performs fsync(). so an
write on an fd with O_SYNC
becomes write+fsync. I am
suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
Without the oflag=sync and
only a single test of each,
the FUSE is going faster
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on both the mounts? No
need to collect profiles.
On Wed, May 10, 2017 at
9:17 PM, Pat Haley
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44
AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Is this the volume
info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18
on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from
old thread from 2016.
This is distribute
volume. Did you
change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pat Haley
2017-05-31 14:03:32 UTC
Permalink
Hi Soumya,

For the latest test we set up a test gluster volume consisting of 2
bricks both residing on an NFS disk (/home). The gluster volume is
neither replicated nor striped. The tests were performed on the server
hosting the disk, so no network was involved.

Addition details of the system are in
http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html
(note that here the tests are now all being done under the /home disk)

Pat
Post by Soumya Koduri
Post by Pranith Kumar Karampuri
Thanks this is good information.
+Soumya
Soumya,
We are trying to find why kNFS is performing way better than
plain distribute glusterfs+fuse. What information do you think will
benefit us to compare the operations with kNFS vs gluster+fuse? We
already have profile output from fuse.
Could be because all operations done by kNFS are local to the system.
The operations done by FUSE mount over network could be more in number
and time-consuming than the ones sent by NFS-client. We could compare
and examine the pattern from tcpump taken over fuse-mount and
NFS-mount. Also nfsstat [1] may give some clue.
Sorry I hadn't followed this mail from the beginning. But is this
comparison between single brick volume and kNFS exporting that brick?
Otherwise its not a fair comparison if the volume is replicated or
distributed.
Thanks,
Soumya
[1] https://linux.die.net/man/8/nfsstat
Post by Pranith Kumar Karampuri
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from
the dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it seems like at least one write operation took 16 seconds. Which
is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us*
393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in the .glusterfs directory of each brick. The median results
(12 dd trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see, what the numbers are. Please provide profile numbers for
the same. From there on we will start tuning the volume to
see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply (but I did receive Ben Turner's follow-up to
your reply). So we tried to create a gluster volume
under /home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify so remove those two words from the CLI.
Also do you have a list of the test we should be
running once we get this volume created? Given the
time-zone difference it might help if we can run a
small battery of tests and post the results rather
than test-post-new test-post... .
This is the first time I am doing performance analysis
on users as far as I remember. In our team there are
separate engineers who do these tests. Ben who replied
earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4
defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume under /home?
I don't think the bottleneck is disk. You can do
the same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which
has lesser disks, we can create plain
distribute volume inside one of those
directories. After we are done, we can remove
the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Hi Pranith,
Since we are mounting the partitions
as the bricks, I tried the dd test
writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the
1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to
target for, considering your volume is
just distribute. Is there any way you can
do tests on similar hardware but at a
small scale? Just so we can run the
workload to learn more about the
bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s
on your /home partition you were telling
me yesterday. Let me know if that is
something you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your answer by some other people
who are more familiar with this.
I am also uncertain about how to
interpret the results when we
also add the dd tests writing to
the /home area (no gluster,
still on the same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570 Mb/s
390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync
(rough average of multiple
tests)
5 Mb/s
200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area
is a RAID-6 of 4 disks while
each brick of the gluster area
is a RAID-6 of 32 disks, I would
naively expect the writes to the
gluster area to be roughly 8x
faster than to the non-gluster.
I think a better test is to try and
write to a file using nfs without
any gluster to a location that is
not inside the brick but someother
location that is on same disk(s). If
you are mounting the partition as
the brick, then we can write to a
file inside .glusterfs directory,
something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue, I can't tell if fuse vs
nfs is part of the problem.
I got interested in the post because
I read that fuse speed is lesser
than nfs speed which is
counter-intuitive to my
understanding. So wanted
clarifications. Now that I got my
clarifications where fuse
outperformed nfs without sync, we
can resume testing as described
above and try to find what it is.
Based on your email-id I am guessing
you are from Boston and I am from
Bangalore so if you are okay with
doing this debugging for multiple
days because of timezones, I will be
happy to help. Please be a bit
patient with me, I am under a
release crunch but I am very curious
with the problem you posted.
Was there anything useful in
the profiles?
Unfortunately profiles didn't help
me much, I think we are collecting
the profiles from an active volume,
so it has a lot of information that
is not pertaining to dd so it is
difficult to find the contributions
of dd. So I went through your post
again and found something I didn't
pay much attention to earlier i.e.
oflag=sync, so did my own tests on
my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this
validates my doubts. Handling
O_SYNC in gluster NFS and fuse
is a bit different.
When application opens a file
with O_SYNC on fuse mount then
each write syscall has to be
written to disk as part of the
syscall where as in case of
NFS, there is no concept of
open. NFS performs write though
a handle saying it needs to be
a synchronous write, so write()
syscall is performed first then
it performs fsync(). so an
write on an fd with O_SYNC
becomes write+fsync. I am
suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
Without the oflag=sync and
only a single test of each,
the FUSE is going faster
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on both the mounts? No
need to collect profiles.
On Wed, May 10, 2017 at
9:17 PM, Pat Haley
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44
AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Is this the volume
info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number of Bricks: 2
on />/nfs.export-volumes: off /
​I copied this from
old thread from 2016.
This is distribute
volume. Did you
change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat
Center for Ocean
Engineering Phone: (617) 253-6824
Dept. of Mechanical
Engineering Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat
Center for Ocean
Engineering Phone: (617) 253-6824
Dept. of Mechanical
Engineering Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat
Center for Ocean
Engineering Phone: (617) 253-6824
Dept. of Mechanical
Engineering Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-05-31 14:37:22 UTC
Permalink
Hi Soumya,

What pattern should we be trying to view with the tcpump? Is a one
minute capture of a copy operation sufficient or are you looking for
something else?

Pat
Post by Soumya Koduri
Post by Pranith Kumar Karampuri
Thanks this is good information.
+Soumya
Soumya,
We are trying to find why kNFS is performing way better than
plain distribute glusterfs+fuse. What information do you think will
benefit us to compare the operations with kNFS vs gluster+fuse? We
already have profile output from fuse.
Could be because all operations done by kNFS are local to the system.
The operations done by FUSE mount over network could be more in number
and time-consuming than the ones sent by NFS-client. We could compare
and examine the pattern from tcpump taken over fuse-mount and
NFS-mount. Also nfsstat [1] may give some clue.
Sorry I hadn't followed this mail from the beginning. But is this
comparison between single brick volume and kNFS exporting that brick?
Otherwise its not a fair comparison if the volume is replicated or
distributed.
Thanks,
Soumya
[1] https://linux.die.net/man/8/nfsstat
Post by Pranith Kumar Karampuri
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from
the dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it seems like at least one write operation took 16 seconds. Which
is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us*
393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in the .glusterfs directory of each brick. The median results
(12 dd trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see, what the numbers are. Please provide profile numbers for
the same. From there on we will start tuning the volume to
see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply (but I did receive Ben Turner's follow-up to
your reply). So we tried to create a gluster volume
under /home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify so remove those two words from the CLI.
Also do you have a list of the test we should be
running once we get this volume created? Given the
time-zone difference it might help if we can run a
small battery of tests and post the results rather
than test-post-new test-post... .
This is the first time I am doing performance analysis
on users as far as I remember. In our team there are
separate engineers who do these tests. Ben who replied
earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4
defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume under /home?
I don't think the bottleneck is disk. You can do
the same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which
has lesser disks, we can create plain
distribute volume inside one of those
directories. After we are done, we can remove
the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Hi Pranith,
Since we are mounting the partitions
as the bricks, I tried the dd test
writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the
1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to
target for, considering your volume is
just distribute. Is there any way you can
do tests on similar hardware but at a
small scale? Just so we can run the
workload to learn more about the
bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s
on your /home partition you were telling
me yesterday. Let me know if that is
something you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your answer by some other people
who are more familiar with this.
I am also uncertain about how to
interpret the results when we
also add the dd tests writing to
the /home area (no gluster,
still on the same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570 Mb/s
390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync
(rough average of multiple
tests)
5 Mb/s
200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area
is a RAID-6 of 4 disks while
each brick of the gluster area
is a RAID-6 of 32 disks, I would
naively expect the writes to the
gluster area to be roughly 8x
faster than to the non-gluster.
I think a better test is to try and
write to a file using nfs without
any gluster to a location that is
not inside the brick but someother
location that is on same disk(s). If
you are mounting the partition as
the brick, then we can write to a
file inside .glusterfs directory,
something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue, I can't tell if fuse vs
nfs is part of the problem.
I got interested in the post because
I read that fuse speed is lesser
than nfs speed which is
counter-intuitive to my
understanding. So wanted
clarifications. Now that I got my
clarifications where fuse
outperformed nfs without sync, we
can resume testing as described
above and try to find what it is.
Based on your email-id I am guessing
you are from Boston and I am from
Bangalore so if you are okay with
doing this debugging for multiple
days because of timezones, I will be
happy to help. Please be a bit
patient with me, I am under a
release crunch but I am very curious
with the problem you posted.
Was there anything useful in
the profiles?
Unfortunately profiles didn't help
me much, I think we are collecting
the profiles from an active volume,
so it has a lot of information that
is not pertaining to dd so it is
difficult to find the contributions
of dd. So I went through your post
again and found something I didn't
pay much attention to earlier i.e.
oflag=sync, so did my own tests on
my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this
validates my doubts. Handling
O_SYNC in gluster NFS and fuse
is a bit different.
When application opens a file
with O_SYNC on fuse mount then
each write syscall has to be
written to disk as part of the
syscall where as in case of
NFS, there is no concept of
open. NFS performs write though
a handle saying it needs to be
a synchronous write, so write()
syscall is performed first then
it performs fsync(). so an
write on an fd with O_SYNC
becomes write+fsync. I am
suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
Without the oflag=sync and
only a single test of each,
the FUSE is going faster
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on both the mounts? No
need to collect profiles.
On Wed, May 10, 2017 at
9:17 PM, Pat Haley
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44
AM, Pranith Kumar
Post by Pranith Kumar Karampuri
Is this the volume
info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number of Bricks: 2
on />/nfs.export-volumes: off /
​I copied this from
old thread from 2016.
This is distribute
volume. Did you
change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat
Center for Ocean
Engineering Phone: (617) 253-6824
Dept. of Mechanical
Engineering Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat
Center for Ocean
Engineering Phone: (617) 253-6824
Dept. of Mechanical
Engineering Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat
Center for Ocean
Engineering Phone: (617) 253-6824
Dept. of Mechanical
Engineering Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Ben Turner
2017-06-02 05:07:28 UTC
Permalink
Are you sure using conv=sync is what you want? I normally use conv=fdatasync, I'll look up the difference between the two and see if it affects your test.


-b

----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM, Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run your
answer by some other people who are
more familiar with this.
I am also uncertain about how to
interpret the results when we also
add the dd tests writing to the
/home area (no gluster, still on the
same machine)
* dd test without oflag=sync
(rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of
32 disks, I would naively expect the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without any
gluster to a location that is not inside
the brick but someother location that is
on same disk(s). If you are mounting the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue,
I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I
read that fuse speed is lesser than nfs
speed which is counter-intuitive to my
understanding. So wanted clarifications.
Now that I got my clarifications where
fuse outperformed nfs without sync, we
can resume testing as described above
and try to find what it is. Based on
your email-id I am guessing you are from
Boston and I am from Bangalore so if you
are okay with doing this debugging for
multiple days because of timezones, I
will be happy to help. Please be a bit
patient with me, I am under a release
crunch but I am very curious with the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help me
much, I think we are collecting the
profiles from an active volume, so it
has a lot of information that is not
pertaining to dd so it is difficult to
find the contributions of dd. So I went
through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own
tests on my setup with FUSE so sent that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync on
both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
on
nfs.exports-auth-enable: on
WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM,
Post by Pranith Kumar Karampuri
Is this the volume info
you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number
of Bricks: 2
/>/Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on />/nfs.disable: on
/>/nfs.export-volumes: off /
​I copied this from old
thread from 2016. This is
distribute volume. Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-06-12 18:35:41 UTC
Permalink
Hi Guys,

I was wondering what our next steps should be to solve the slow write times.

Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.

Thanks

Pat
Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use conv=fdatasync, I'll look up the difference between the two and see if it affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM, Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run your
answer by some other people who are
more familiar with this.
I am also uncertain about how to
interpret the results when we also
add the dd tests writing to the
/home area (no gluster, still on the
same machine)
* dd test without oflag=sync
(rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of
32 disks, I would naively expect the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without any
gluster to a location that is not inside
the brick but someother location that is
on same disk(s). If you are mounting the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue,
I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I
read that fuse speed is lesser than nfs
speed which is counter-intuitive to my
understanding. So wanted clarifications.
Now that I got my clarifications where
fuse outperformed nfs without sync, we
can resume testing as described above
and try to find what it is. Based on
your email-id I am guessing you are from
Boston and I am from Bangalore so if you
are okay with doing this debugging for
multiple days because of timezones, I
will be happy to help. Please be a bit
patient with me, I am under a release
crunch but I am very curious with the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help me
much, I think we are collecting the
profiles from an active volume, so it
has a lot of information that is not
pertaining to dd so it is difficult to
find the contributions of dd. So I went
through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own
tests on my setup with FUSE so sent that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync on
both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
on
nfs.exports-auth-enable: on
WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM,
Post by Pranith Kumar Karampuri
Is this the volume info
you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number
of Bricks: 2
/>/Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on />/nfs.disable: on
/>/nfs.export-volumes: off /
​I copied this from old
thread from 2016. This is
distribute volume. Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Ben Turner
2017-06-12 20:28:09 UTC
Permalink
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
I can see in your test:

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt

You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB / sec} / #replicas{2} = 600). Gluster does client side replication so with replica 2 you will only ever see 1/2 the speed of your slowest part of the stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is normally a best case. Now in your output I do see the instances where you went down to 200 MB / sec. I can only explain this in three ways:

1. You are not using conv=fdatasync and writes are actually going to page cache and then being flushed to disk. During the fsync the memory is not yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN) and when write times are slow the RAID group is busy serviceing other LUNs.
3. Gluster bug / config issue / some other unknown unknown.

So I see 2 issues here:

1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.

WRT #1 - have a look at my estimates above. My formula for guestimating gluster perf is: throughput = NIC throughput or storage(whatever is slower) / # replicas * overhead(figure .7 or .8). Also the larger the record size the better for glusterfs mounts, I normally like to be at LEAST 64k up to 1024k:

# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync

WRT #2 - Again, I question your testing and your storage config. Try using conv=fdatasync for your DDs, use a larger record size, and make sure that your back end storage is not causing your slowdowns. Also remember that with replica 2 you will take ~50% hit on writes because the client uses 50% of its bandwidth to write to one replica and 50% to the other.

-b
Thanks
Pat
Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat
Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM, Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run your
answer by some other people who are
more familiar with this.
I am also uncertain about how to
interpret the results when we also
add the dd tests writing to the
/home area (no gluster, still on the
same machine)
* dd test without oflag=sync
(rough average of multiple tests)
o gluster w/ fuse mount : 570
Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of
32 disks, I would naively expect the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without any
gluster to a location that is not inside
the brick but someother location that is
on same disk(s). If you are mounting the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue,
I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I
read that fuse speed is lesser than nfs
speed which is counter-intuitive to my
understanding. So wanted clarifications.
Now that I got my clarifications where
fuse outperformed nfs without sync, we
can resume testing as described above
and try to find what it is. Based on
your email-id I am guessing you are from
Boston and I am from Bangalore so if you
are okay with doing this debugging for
multiple days because of timezones, I
will be happy to help. Please be a bit
patient with me, I am under a release
crunch but I am very curious with the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help me
much, I think we are collecting the
profiles from an active volume, so it
has a lot of information that is not
pertaining to dd so it is difficult to
find the contributions of dd. So I went
through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own
tests on my setup with FUSE so sent that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync on
both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
nfs.exports-auth-enable: on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM,
Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Is this the volume info
you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
/>/Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on />/nfs.disable: on
/>/nfs.export-volumes: off
/
​I copied this from old
thread from 2016. This is
distribute volume. Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-06-20 16:06:30 UTC
Permalink
Hi Ben,

Sorry this took so long, but we had a real-time forecasting exercise
last week and I could only get to this now.

Backend Hardware/OS:

* Much of the information on our back end system is included at the
top of
http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html
* The specific model of the hard disks is SeaGate ENTERPRISE CAPACITY
V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
* Note: there is one physical server that hosts both the NFS and the
GlusterFS areas

Latest tests

I have had time to run the tests for one of the dd tests you requested
to the underlying XFS FS. The median rate was 170 MB/s. The dd results
and iostat record are in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/

I'll add tests for the other brick and to the NFS area later.

Thanks

Pat
throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat
Post by Ben Turner
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b
Thanks
Pat
Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see
what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted
under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume
under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify
so remove those two words from the CLI.
Also do you have a list of the test we should be
running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small
battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis
on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is
one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On 05/11/2017 12:06 PM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do
the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is
our
production hardware.
You said something about /home partition which
has
lesser disks, we can create plain distribute
volume inside one of those directories. After
we
are done, we can remove the setup. What do you
say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn
more
about the bottlenecks in the system? We
can
probably try to get the speed to 1.2Gb/s
on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri
Post by Pranith Kumar Karampuri
Is this the volume
info
you have?
/[root at
mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
​I copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-06-22 20:53:42 UTC
Permalink
Hi,

Today we experimented with some of the FUSE options that we found in the
list.

Changing these options had no effect:

gluster volume set test-volume performance.cache-max-file-size 2MB
gluster volume set test-volume performance.cache-refresh-timeout 4
gluster volume set test-volume performance.cache-size 256MB
gluster volume set test-volume performance.write-behind-window-size 4MB
gluster volume set test-volume performance.write-behind-window-size 8MB

Changing the following option from its default value made the speed slower

gluster volume set test-volume performance.write-behind off (on by default)

Changing the following options initially appeared to give a 10% increase
in speed, but this vanished in subsequent tests (we think the apparent
increase may have been to a lighter workload on the computer from other
users)

gluster volume set test-volume performance.stat-prefetch on
gluster volume set test-volume client.event-threads 4
gluster volume set test-volume server.event-threads 4


Can anything be gleaned from these observations? Are there other things
we can try?

Thanks

Pat
Post by Pat Haley
Hi Ben,
Sorry this took so long, but we had a real-time forecasting exercise
last week and I could only get to this now.
* Much of the information on our back end system is included at the
top of
http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html
* The specific model of the hard disks is SeaGate ENTERPRISE
CAPACITY V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
* Note: there is one physical server that hosts both the NFS and the
GlusterFS areas
Latest tests
I have had time to run the tests for one of the dd tests you requested
to the underlying XFS FS. The median rate was 170 MB/s. The dd
results and iostat record are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/
I'll add tests for the other brick and to the NFS area later.
Thanks
Pat
throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat
Post by Ben Turner
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b
Thanks
Pat
Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see
what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted
under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume
under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify
so remove those two words from the CLI.
Also do you have a list of the test we should be
running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small
battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis
on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is
one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On 05/11/2017 12:06 PM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do
the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is
our
production hardware.
You said something about /home partition which
has
lesser disks, we can create plain distribute
volume inside one of those directories. After
we
are done, we can remove the setup. What do you
say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn
more
about the bottlenecks in the system? We
can
probably try to get the speed to 1.2Gb/s
on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri
Post by Pranith Kumar Karampuri
Is this the volume
info
you have?
/[root at
mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
​I copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-06-23 03:40:49 UTC
Permalink
Post by Pat Haley
Hi,
Today we experimented with some of the FUSE options that we found in the
list.
gluster volume set test-volume performance.cache-max-file-size 2MB
gluster volume set test-volume performance.cache-refresh-timeout 4
gluster volume set test-volume performance.cache-size 256MB
gluster volume set test-volume performance.write-behind-window-size 4MB
gluster volume set test-volume performance.write-behind-window-size 8MB
This is a good coincidence, I am meeting with write-behind
maintainer(+Raghavendra G) today for the same doubt. I think we will have
something by EOD IST. I will update you.
Post by Pat Haley
Changing the following option from its default value made the speed slower
gluster volume set test-volume performance.write-behind off (on by default)
Changing the following options initially appeared to give a 10% increase
in speed, but this vanished in subsequent tests (we think the apparent
increase may have been to a lighter workload on the computer from other
users)
gluster volume set test-volume performance.stat-prefetch on
gluster volume set test-volume client.event-threads 4
gluster volume set test-volume server.event-threads 4
Can anything be gleaned from these observations? Are there other things
we can try?
Thanks
Pat
Hi Ben,
Sorry this took so long, but we had a real-time forecasting exercise last
week and I could only get to this now.
- Much of the information on our back end system is included at the
top of http://lists.gluster.org/pipermail/gluster-users/2017-
April/030529.html
- The specific model of the hard disks is SeaGate ENTERPRISE CAPACITY
V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
- Note: there is one physical server that hosts both the NFS and the
GlusterFS areas
Latest tests
I have had time to run the tests for one of the dd tests you requested to
the underlying XFS FS. The median rate was 170 MB/s. The dd results and
iostat record are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/
I'll add tests for the other brick and to the NFS area later.
Thanks
Pat
throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000
conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b
Thanks
Pat
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted
under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume
under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small
battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On 05/11/2017 12:06 PM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is
our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We
can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri
Is this the volume
info
you have?
/[root at
mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users> <http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
​I copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Pranith Kumar Karampuri
2017-06-24 05:43:53 UTC
Permalink
On Fri, Jun 23, 2017 at 9:10 AM, Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Post by Pat Haley
Hi,
Today we experimented with some of the FUSE options that we found in the
list.
gluster volume set test-volume performance.cache-max-file-size 2MB
gluster volume set test-volume performance.cache-refresh-timeout 4
gluster volume set test-volume performance.cache-size 256MB
gluster volume set test-volume performance.write-behind-window-size 4MB
gluster volume set test-volume performance.write-behind-window-size 8MB
This is a good coincidence, I am meeting with write-behind
maintainer(+Raghavendra G) today for the same doubt. I think we will have
something by EOD IST. I will update you.
Sorry, forgot to update you. It seems like there is a bug in Write-behind
and Facebook guys sent a patch http://review.gluster.org/16079 to fix the
same. But even with that I am not seeing any improvement. May be I am doing
something wrong. Will update you if I find anything more.
Post by Pranith Kumar Karampuri
Changing the following option from its default value made the speed slower
Post by Pat Haley
gluster volume set test-volume performance.write-behind off (on by default)
Changing the following options initially appeared to give a 10% increase
in speed, but this vanished in subsequent tests (we think the apparent
increase may have been to a lighter workload on the computer from other
users)
gluster volume set test-volume performance.stat-prefetch on
gluster volume set test-volume client.event-threads 4
gluster volume set test-volume server.event-threads 4
Can anything be gleaned from these observations? Are there other things
we can try?
Thanks
Pat
Hi Ben,
Sorry this took so long, but we had a real-time forecasting exercise last
week and I could only get to this now.
- Much of the information on our back end system is included at the
top of http://lists.gluster.org/pipermail/gluster-users/2017-April/
030529.html
- The specific model of the hard disks is SeaGate ENTERPRISE CAPACITY
V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
- Note: there is one physical server that hosts both the NFS and the
GlusterFS areas
Latest tests
I have had time to run the tests for one of the dd tests you requested to
the underlying XFS FS. The median rate was 170 MB/s. The dd results and
iostat record are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/
I'll add tests for the other brick and to the NFS area later.
Thanks
Pat
throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000
conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b
Thanks
Pat
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small
battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn
more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri
Is this the volume
info
you have?
/[root at
mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users> <http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
​I copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
Pranith
Pat Haley
2017-06-26 14:10:54 UTC
Permalink
Hi All,

Decided to try another tests of gluster mounted via FUSE vs gluster
mounted via NFS, this time using the software we run in production (i.e.
our ocean model writing a netCDF file).

gluster mounted via NFS the run took 2.3 hr

gluster mounted via FUSE: the run took 44.2 hr

The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.

We have an exercise coming up in the a couple of weeks. It seems to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.

What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?

Thanks

Pat
Post by Pranith Kumar Karampuri
On Fri, Jun 23, 2017 at 9:10 AM, Pranith Kumar Karampuri
Hi,
Today we experimented with some of the FUSE options that we
found in the list.
gluster volume set test-volume performance.cache-max-file-size 2MB
gluster volume set test-volume performance.cache-refresh-timeout 4
gluster volume set test-volume performance.cache-size 256MB
gluster volume set test-volume performance.write-behind-window-size 4MB
gluster volume set test-volume performance.write-behind-window-size 8MB
This is a good coincidence, I am meeting with write-behind
maintainer(+Raghavendra G) today for the same doubt. I think we
will have something by EOD IST. I will update you.
Sorry, forgot to update you. It seems like there is a bug in
Write-behind and Facebook guys sent a patch
http://review.gluster.org/16079 to fix the same. But even with that I
am not seeing any improvement. May be I am doing something wrong. Will
update you if I find anything more.
Changing the following option from its default value made the speed slower
gluster volume set test-volume performance.write-behind off (on by default)
Changing the following options initially appeared to give a
10% increase in speed, but this vanished in subsequent tests
(we think the apparent increase may have been to a lighter
workload on the computer from other users)
gluster volume set test-volume performance.stat-prefetch on
gluster volume set test-volume client.event-threads 4
gluster volume set test-volume server.event-threads 4
Can anything be gleaned from these observations? Are there
other things we can try?
Thanks
Pat
Post by Pat Haley
Hi Ben,
Sorry this took so long, but we had a real-time forecasting
exercise last week and I could only get to this now.
* Much of the information on our back end system is
included at the top of
http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html
<http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html>
* The specific model of the hard disks is SeaGate
ENTERPRISE CAPACITY V.4 6TB (ST6000NM0024). The rated
speed is 6Gb/s.
* Note: there is one physical server that hosts both the
NFS and the GlusterFS areas
Latest tests
I have had time to run the tests for one of the dd tests you
requested to the underlying XFS FS. The median rate was 170
MB/s. The dd results and iostat record are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/>
I'll add tests for the other brick and to the NFS area later.
Thanks
Pat
throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat
Post by Ben Turner
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write
times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000
conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b
Thanks
Pat
Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if
it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from
the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>
Pat
Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see
what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted
under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume
under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify
so remove those two words from the CLI.
Also do you have a list of the test we should be
running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small
battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis
on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is
one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On 05/11/2017 12:06 PM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do
the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is
our
production hardware.
You said something about /home partition which
has
lesser disks, we can create plain distribute
volume inside one of those directories. After
we
are done, we can remove the setup. What do you
say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn
more
about the bottlenecks in the system? We
can
probably try to get the speed to 1.2Gb/s
on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Post by Pranith Kumar Karampuri
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith
Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri
Post by Pranith Kumar Karampuri
Is this the volume
info
you have?
/[root at
mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
​I copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
<http://web.mit.edu/phaley/www/>
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pranith Kumar Karampuri
2017-06-27 04:47:40 UTC
Permalink
Post by Pat Haley
Hi All,
Decided to try another tests of gluster mounted via FUSE vs gluster
mounted via NFS, this time using the software we run in production (i.e.
our ocean model writing a netCDF file).
gluster mounted via NFS the run took 2.3 hr
gluster mounted via FUSE: the run took 44.2 hr
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems to me
that in order to improve our write times before then, it would be good to
solve the group write permissions for gluster mounted via NFS now. We can
then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
+Niels, +Jiffin

I added 2 more guys who work on NFS to check why this problem happens in
your environment. Let's see what information they may need to find the
problem and solve this issue.
Post by Pat Haley
Thanks
Pat
On Fri, Jun 23, 2017 at 9:10 AM, Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Post by Pat Haley
Hi,
Today we experimented with some of the FUSE options that we found in the
list.
gluster volume set test-volume performance.cache-max-file-size 2MB
gluster volume set test-volume performance.cache-refresh-timeout 4
gluster volume set test-volume performance.cache-size 256MB
gluster volume set test-volume performance.write-behind-window-size 4MB
gluster volume set test-volume performance.write-behind-window-size 8MB
This is a good coincidence, I am meeting with write-behind
maintainer(+Raghavendra G) today for the same doubt. I think we will have
something by EOD IST. I will update you.
Sorry, forgot to update you. It seems like there is a bug in Write-behind
and Facebook guys sent a patch http://review.gluster.org/16079 to fix the
same. But even with that I am not seeing any improvement. May be I am doing
something wrong. Will update you if I find anything more.
Post by Pranith Kumar Karampuri
Changing the following option from its default value made the speed slower
Post by Pat Haley
gluster volume set test-volume performance.write-behind off (on by default)
Changing the following options initially appeared to give a 10% increase
in speed, but this vanished in subsequent tests (we think the apparent
increase may have been to a lighter workload on the computer from other
users)
gluster volume set test-volume performance.stat-prefetch on
gluster volume set test-volume client.event-threads 4
gluster volume set test-volume server.event-threads 4
Can anything be gleaned from these observations? Are there other things
we can try?
Thanks
Pat
Hi Ben,
Sorry this took so long, but we had a real-time forecasting exercise
last week and I could only get to this now.
- Much of the information on our back end system is included at the
top of http://lists.gluster.org/pipermail/gluster-users/2017-April/
030529.html
- The specific model of the hard disks is SeaGate ENTERPRISE
CAPACITY V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
- Note: there is one physical server that hosts both the NFS and the
GlusterFS areas
Latest tests
I have had time to run the tests for one of the dd tests you requested
to the underlying XFS FS. The median rate was 170 MB/s. The dd results
and iostat record are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/
I'll add tests for the other brick and to the NFS area later.
Thanks
Pat
throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b
Thanks
Pat
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is
our
production hardware.
You said something about /home partition which
has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you
say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn
more
about the bottlenecks in the system? We
can
probably try to get the speed to 1.2Gb/s
on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri
Is this the volume
info
you have?
/[root at
mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users> <http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
​I copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Soumya Koduri
2017-06-27 06:45:50 UTC
Permalink
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special permission
checks done by gNFS server when compared to gluster native client.

Could you please provide simple steps to reproduce the issue and collect
pkt trace and nfs/brick logs as well.

Thanks,
Soumya
Pat Haley
2017-06-27 16:29:48 UTC
Permalink
Hi Soumya,

One example, we have a common working directory dri_fleat in the gluster
volume

drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat

my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following

ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied

One of the sub-directories under dri_fleat is "test" which phaley owns

drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test

Under this directory (mounted via nfs) user phaley can write

ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%

I have put the packet captures in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/

capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment

The command I used for these was

tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w /root/capture_nfstest.pcap

The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from around
2017-06-27 14:02:37.404865 even though the system time was 2017-06-27
12:00:00.

One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.

Let me know what additional information I can provide.

Thanks

Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special permission
checks done by gNFS server when compared to gluster native client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-06-30 14:26:46 UTC
Permalink
Hi,

I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?

Thanks

Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w /root/capture_nfstest.pcap
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Soumya Koduri
2017-07-03 11:58:26 UTC
Permalink
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w /root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is running.
Could you please use '-i any' as I do not see glusterfs traffic in the
tcpdump.

Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error may
well be coming from kernel-NFS itself before the request is sent to
fuse-mnt process.

FWIW, we have below option -

Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.

I haven't looked into what this option exactly does. But it may worth
testing with this option on.

Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
Pat Haley
2017-07-03 15:31:35 UTC
Permalink
Hi Soumya,

When I originally did the tests I ran tcpdump on the client.

I have rerun the tests, doing tcpdump on the server

tcpdump -i any -nnSs 0 host 172.16.1.121 -w /root/capture_nfsfail.pcap

The results are in the same place

http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/

capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment

The brick log files are there too.

I believe we are using kernel-NFS exporting a fuse mounted gluster
volume. I am having Steve confirm this. I tried to find the fuse-mnt
logs but failed. Where should I look for them?

Thanks

Pat
Post by Soumya Koduri
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w /root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is
running. Could you please use '-i any' as I do not see glusterfs
traffic in the tcpdump.
Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error
may well be coming from kernel-NFS itself before the request is sent
to fuse-mnt process.
FWIW, we have below option -
Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.
I haven't looked into what this option exactly does. But it may worth
testing with this option on.
Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems
to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Soumya Koduri
2017-07-04 09:01:48 UTC
Permalink
Post by Pat Haley
Hi Soumya,
When I originally did the tests I ran tcpdump on the client.
I have rerun the tests, doing tcpdump on the server
tcpdump -i any -nnSs 0 host 172.16.1.121 -w /root/capture_nfsfail.pcap
The results are in the same place
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The brick log files are there too.
Thanks for sharing. Looks like the error is not generated
@gluster-server side. The permission denied error was caused by either
kNFS or by fuse-mnt process or probably by the combination.

To check fuse-mnt logs, please look at
/var/log/glusterfs/<fuse_mnt_direcotry>.log

For eg.: if you have fuse mounted the gluster volume at /mnt/fuse-mnt
and exported it via kNFS, the log location for that fuse_mnt shall be at
/var/log/glusterfs/mnt-fuse-mnt.log


Also why not switch to either gluster-NFS native server or NFS-Ganesha
instead of using kNFS, as they are recommended NFS servers to use with
gluster?

Thanks,
Soumya
Post by Pat Haley
I believe we are using kernel-NFS exporting a fuse mounted gluster
volume. I am having Steve confirm this. I tried to find the fuse-mnt
logs but failed. Where should I look for them?
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w /root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is
running. Could you please use '-i any' as I do not see glusterfs
traffic in the tcpdump.
Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error
may well be coming from kernel-NFS itself before the request is sent
to fuse-mnt process.
FWIW, we have below option -
Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.
I haven't looked into what this option exactly does. But it may worth
testing with this option on.
Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems
to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
Pat Haley
2017-07-05 15:36:17 UTC
Permalink
Hi Soumya,

(1) In http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
I've placed the following 2 log files

etc-glusterfs-glusterd.vol.log
gdata.log

The first has repeated messages about nfs disconnects. The second had
the <fuse_mnt_direcotry>.log name (but not much information).

(2) About the gluster-NFS native server: do you know where we can find
documentation on how to use/install it? We haven't had success in our
searches.

Thanks

Pat
Post by Soumya Koduri
Post by Pat Haley
Hi Soumya,
When I originally did the tests I ran tcpdump on the client.
I have rerun the tests, doing tcpdump on the server
tcpdump -i any -nnSs 0 host 172.16.1.121 -w /root/capture_nfsfail.pcap
The results are in the same place
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The brick log files are there too.
Thanks for sharing. Looks like the error is not generated
@gluster-server side. The permission denied error was caused by either
kNFS or by fuse-mnt process or probably by the combination.
To check fuse-mnt logs, please look at
/var/log/glusterfs/<fuse_mnt_direcotry>.log
For eg.: if you have fuse mounted the gluster volume at /mnt/fuse-mnt
and exported it via kNFS, the log location for that fuse_mnt shall be
at /var/log/glusterfs/mnt-fuse-mnt.log
Also why not switch to either gluster-NFS native server or NFS-Ganesha
instead of using kNFS, as they are recommended NFS servers to use with
gluster?
Thanks,
Soumya
Post by Pat Haley
I believe we are using kernel-NFS exporting a fuse mounted gluster
volume. I am having Steve confirm this. I tried to find the fuse-mnt
logs but failed. Where should I look for them?
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w
/root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is
running. Could you please use '-i any' as I do not see glusterfs
traffic in the tcpdump.
Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error
may well be coming from kernel-NFS itself before the request is sent
to fuse-mnt process.
FWIW, we have below option -
Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.
I haven't looked into what this option exactly does. But it may worth
testing with this option on.
Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems
to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted
via
NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Pat Haley
2017-07-07 00:46:24 UTC
Permalink
Hi All,

A follow-up question. I've been looking at various pages on nfs-ganesha
& gluster. Is there a version of nfs-ganesha that is recommended for
use with

glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)

Thanks

Pat
Post by Pat Haley
Hi Soumya,
(1) In http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
I've placed the following 2 log files
etc-glusterfs-glusterd.vol.log
gdata.log
The first has repeated messages about nfs disconnects. The second had
the <fuse_mnt_direcotry>.log name (but not much information).
(2) About the gluster-NFS native server: do you know where we can
find documentation on how to use/install it? We haven't had success
in our searches.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi Soumya,
When I originally did the tests I ran tcpdump on the client.
I have rerun the tests, doing tcpdump on the server
tcpdump -i any -nnSs 0 host 172.16.1.121 -w /root/capture_nfsfail.pcap
The results are in the same place
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The brick log files are there too.
Thanks for sharing. Looks like the error is not generated
@gluster-server side. The permission denied error was caused by
either kNFS or by fuse-mnt process or probably by the combination.
To check fuse-mnt logs, please look at
/var/log/glusterfs/<fuse_mnt_direcotry>.log
For eg.: if you have fuse mounted the gluster volume at /mnt/fuse-mnt
and exported it via kNFS, the log location for that fuse_mnt shall be
at /var/log/glusterfs/mnt-fuse-mnt.log
Also why not switch to either gluster-NFS native server or
NFS-Ganesha instead of using kNFS, as they are recommended NFS
servers to use with gluster?
Thanks,
Soumya
Post by Pat Haley
I believe we are using kernel-NFS exporting a fuse mounted gluster
volume. I am having Steve confirm this. I tried to find the fuse-mnt
logs but failed. Where should I look for them?
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w
/root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is
running. Could you please use '-i any' as I do not see glusterfs
traffic in the tcpdump.
Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error
may well be coming from kernel-NFS itself before the request is sent
to fuse-mnt process.
FWIW, we have below option -
Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.
I haven't looked into what this option exactly does. But it may worth
testing with this option on.
Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems
to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster
mounted via
NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Soumya Koduri
2017-07-07 17:31:42 UTC
Permalink
Hi,
Post by Pat Haley
Hi All,
A follow-up question. I've been looking at various pages on nfs-ganesha
& gluster. Is there a version of nfs-ganesha that is recommended for
use with
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
For glusterfs 3.7, nfs-ganesha-2.3-* version can be used.

I see the packages built in centos7 storage sig [1] but not for centos6.
Request Niels to comment.
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
(1) In http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
I've placed the following 2 log files
etc-glusterfs-glusterd.vol.log
gdata.log
The first has repeated messages about nfs disconnects. The second had
the <fuse_mnt_direcotry>.log name (but not much information).
Hmm yeah..weird ..there are not much logs in fuse mnt log file.
Post by Pat Haley
Post by Pat Haley
(2) About the gluster-NFS native server: do you know where we can
find documentation on how to use/install it? We haven't had success
in our searches.
Till glusterfs-3.7, gluster-NFS (gNFS) gets enabled by default. The only
requirement is that kernel-NFS has to be disabled for gluster-NFS to
come up. Please disable kernel-NFS server and restart glusterd to start
gNFS. In case of any issues with starting gNFS server, please look at
/var/log/glusterfs/nfs.log.

Thanks,
Soumya


[1] https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-3.7/
[2] https://buildlogs.centos.org/centos/6/storage/x86_64/gluster-3.7/
Post by Pat Haley
Post by Pat Haley
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi Soumya,
When I originally did the tests I ran tcpdump on the client.
I have rerun the tests, doing tcpdump on the server
tcpdump -i any -nnSs 0 host 172.16.1.121 -w /root/capture_nfsfail.pcap
The results are in the same place
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The brick log files are there too.
Thanks for sharing. Looks like the error is not generated
@gluster-server side. The permission denied error was caused by
either kNFS or by fuse-mnt process or probably by the combination.
To check fuse-mnt logs, please look at
/var/log/glusterfs/<fuse_mnt_direcotry>.log
For eg.: if you have fuse mounted the gluster volume at /mnt/fuse-mnt
and exported it via kNFS, the log location for that fuse_mnt shall be
at /var/log/glusterfs/mnt-fuse-mnt.log
Also why not switch to either gluster-NFS native server or
NFS-Ganesha instead of using kNFS, as they are recommended NFS
servers to use with gluster?
Thanks,
Soumya
Post by Pat Haley
I believe we are using kernel-NFS exporting a fuse mounted gluster
volume. I am having Steve confirm this. I tried to find the fuse-mnt
logs but failed. Where should I look for them?
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w
/root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is
running. Could you please use '-i any' as I do not see glusterfs
traffic in the tcpdump.
Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error
may well be coming from kernel-NFS itself before the request is sent
to fuse-mnt process.
FWIW, we have below option -
Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.
I haven't looked into what this option exactly does. But it may worth
testing with this option on.
Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems
to me
that in order to improve our write times before then, it would be good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster
mounted via
NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
Pat Haley
2017-07-14 01:10:17 UTC
Permalink
Hi Soumya,

I just noticed some of the notes at the bottom. In particular

* Till glusterfs-3.7, gluster-NFS (gNFS) gets enabled by default. The
only requirement is that kernel-NFS has to be disabled for
gluster-NFS to come up. Please disable kernel-NFS server and restart
glusterd to start gNFS. In case of any issues with starting gNFS
server, please look at /var/log/glusterfs/nfs.log.

If we disable the kernel-NFS on our server and restart glusterd to start
gNFS will that affect the NFS file system also being served by that
server (i.e. the single server serves both a glusterFS area and an NFS
area)? Would we also have to disable the kernel-NFS for NFS-ganesha?

My second question concerns NFS-ganesha (v 2.3.x) for CentOS 6.8 and
gluster 3.7.11. I think I see a couple of possibilities

1. I see one possible rpm for version 2.3.3 in
https://mirror.chpc.utah.edu/pub/vault.centos.org/centos/6.8/storage/Source/gluster-3.8/
The other rpm's seem to be for gluster 3.8 packages, so I'm
wondering if there is a concern for conflict
2. In one of the links you sent
(https://buildlogs.centos.org/centos/6/storage/x86_64/gluster-3.7/)
I see an rpm for glusterfs-ganesha-3.7.11 . Is this a specific
gluster package for compatibility with ganesha or a ganesha package
for gluster?

Does either possibility seem more likely to be what I need than the other?

Pat
Post by Soumya Koduri
Hi,
Post by Pat Haley
Hi All,
A follow-up question. I've been looking at various pages on nfs-ganesha
& gluster. Is there a version of nfs-ganesha that is recommended for
use with
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
For glusterfs 3.7, nfs-ganesha-2.3-* version can be used.
I see the packages built in centos7 storage sig [1] but not for
centos6. Request Niels to comment.
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
(1) In http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
I've placed the following 2 log files
etc-glusterfs-glusterd.vol.log
gdata.log
The first has repeated messages about nfs disconnects. The second had
the <fuse_mnt_direcotry>.log name (but not much information).
Hmm yeah..weird ..there are not much logs in fuse mnt log file.
Post by Pat Haley
Post by Pat Haley
(2) About the gluster-NFS native server: do you know where we can
find documentation on how to use/install it? We haven't had success
in our searches.
Till glusterfs-3.7, gluster-NFS (gNFS) gets enabled by default. The
only requirement is that kernel-NFS has to be disabled for gluster-NFS
to come up. Please disable kernel-NFS server and restart glusterd to
start gNFS. In case of any issues with starting gNFS server, please
look at /var/log/glusterfs/nfs.log.
Thanks,
Soumya
[1] https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-3.7/
[2] https://buildlogs.centos.org/centos/6/storage/x86_64/gluster-3.7/
Post by Pat Haley
Post by Pat Haley
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi Soumya,
When I originally did the tests I ran tcpdump on the client.
I have rerun the tests, doing tcpdump on the server
tcpdump -i any -nnSs 0 host 172.16.1.121 -w
/root/capture_nfsfail.pcap
The results are in the same place
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The brick log files are there too.
Thanks for sharing. Looks like the error is not generated
@gluster-server side. The permission denied error was caused by
either kNFS or by fuse-mnt process or probably by the combination.
To check fuse-mnt logs, please look at
/var/log/glusterfs/<fuse_mnt_direcotry>.log
For eg.: if you have fuse mounted the gluster volume at /mnt/fuse-mnt
and exported it via kNFS, the log location for that fuse_mnt shall be
at /var/log/glusterfs/mnt-fuse-mnt.log
Also why not switch to either gluster-NFS native server or
NFS-Ganesha instead of using kNFS, as they are recommended NFS
servers to use with gluster?
Thanks,
Soumya
Post by Pat Haley
I believe we are using kernel-NFS exporting a fuse mounted gluster
volume. I am having Steve confirm this. I tried to find the fuse-mnt
logs but failed. Where should I look for them?
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w
/root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is
running. Could you please use '-i any' as I do not see glusterfs
traffic in the tcpdump.
Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error
may well be coming from kernel-NFS itself before the request is sent
to fuse-mnt process.
FWIW, we have below option -
Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.
I haven't looked into what this option exactly does. But it may worth
testing with this option on.
Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems
to me
that in order to improve our write times before then, it
would be
good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster
mounted via
NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Soumya Koduri
2017-07-14 05:04:13 UTC
Permalink
Post by Pat Haley
Hi Soumya,
I just noticed some of the notes at the bottom. In particular
* Till glusterfs-3.7, gluster-NFS (gNFS) gets enabled by default. The
only requirement is that kernel-NFS has to be disabled for
gluster-NFS to come up. Please disable kernel-NFS server and restart
glusterd to start gNFS. In case of any issues with starting gNFS
server, please look at /var/log/glusterfs/nfs.log.
If we disable the kernel-NFS on our server and restart glusterd to start
gNFS will that affect the NFS file system also being served by that
server (i.e. the single server serves both a glusterFS area and an NFS
area)? clients
Thats right. When you restart glusterd, it tries to spawn (provided
nfs.disable option is set to off for any volume) a new glusterfs client
process which acts like NFS-server as well.

Would we also have to disable the kernel-NFS for NFS-ganesha?

yes.
Post by Pat Haley
My second question concerns NFS-ganesha (v 2.3.x) for CentOS 6.8 and
gluster 3.7.11. I think I see a couple of possibilities
1. I see one possible rpm for version 2.3.3 in
https://mirror.chpc.utah.edu/pub/vault.centos.org/centos/6.8/storage/Source/gluster-3.8/
The other rpm's seem to be for gluster 3.8 packages, so I'm
wondering if there is a concern for conflict
AFAIK, nfs-ganesha-2.3.3 should work with both 3.8 & 3.7 gluster.
Post by Pat Haley
2. In one of the links you sent
(https://buildlogs.centos.org/centos/6/storage/x86_64/gluster-3.7/)
I see an rpm for glusterfs-ganesha-3.7.11 . Is this a specific
gluster package for compatibility with ganesha or a ganesha package
for gluster?
This is to be compatible with gluster-3.7* package.
Post by Pat Haley
Does either possibility seem more likely to be what I need than the other?
The current stable/maintained/tested combination is nfs-ganesha2.4/2.5 +
glusterfs-3.8/3.10. But however incase you cannot upgrade, you can still
use nfs-ganesha-2.3* with glusterfs-3.8/3.7

Hope it is clear.

Thanks,
Soumya
Post by Pat Haley
Pat
Post by Soumya Koduri
Hi,
Post by Pat Haley
Hi All,
A follow-up question. I've been looking at various pages on nfs-ganesha
& gluster. Is there a version of nfs-ganesha that is recommended for
use with
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
For glusterfs 3.7, nfs-ganesha-2.3-* version can be used.
I see the packages built in centos7 storage sig [1] but not for
centos6. Request Niels to comment.
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
(1) In http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
I've placed the following 2 log files
etc-glusterfs-glusterd.vol.log
gdata.log
The first has repeated messages about nfs disconnects. The second had
the <fuse_mnt_direcotry>.log name (but not much information).
Hmm yeah..weird ..there are not much logs in fuse mnt log file.
Post by Pat Haley
Post by Pat Haley
(2) About the gluster-NFS native server: do you know where we can
find documentation on how to use/install it? We haven't had success
in our searches.
Till glusterfs-3.7, gluster-NFS (gNFS) gets enabled by default. The
only requirement is that kernel-NFS has to be disabled for gluster-NFS
to come up. Please disable kernel-NFS server and restart glusterd to
start gNFS. In case of any issues with starting gNFS server, please
look at /var/log/glusterfs/nfs.log.
Thanks,
Soumya
[1] https://buildlogs.centos.org/centos/7/storage/x86_64/gluster-3.7/
[2] https://buildlogs.centos.org/centos/6/storage/x86_64/gluster-3.7/
Post by Pat Haley
Post by Pat Haley
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi Soumya,
When I originally did the tests I ran tcpdump on the client.
I have rerun the tests, doing tcpdump on the server
tcpdump -i any -nnSs 0 host 172.16.1.121 -w
/root/capture_nfsfail.pcap
The results are in the same place
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The brick log files are there too.
Thanks for sharing. Looks like the error is not generated
@gluster-server side. The permission denied error was caused by
either kNFS or by fuse-mnt process or probably by the combination.
To check fuse-mnt logs, please look at
/var/log/glusterfs/<fuse_mnt_direcotry>.log
For eg.: if you have fuse mounted the gluster volume at /mnt/fuse-mnt
and exported it via kNFS, the log location for that fuse_mnt shall be
at /var/log/glusterfs/mnt-fuse-mnt.log
Also why not switch to either gluster-NFS native server or
NFS-Ganesha instead of using kNFS, as they are recommended NFS
servers to use with gluster?
Thanks,
Soumya
Post by Pat Haley
I believe we are using kernel-NFS exporting a fuse mounted gluster
volume. I am having Steve confirm this. I tried to find the fuse-mnt
logs but failed. Where should I look for them?
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
Hi,
I was wondering if there were any additional test we could perform to
help debug the group write-permissions issue?
Sorry for the delay. Please find response inline --
Post by Pat Haley
Thanks
Pat
Post by Pat Haley
Hi Soumya,
One example, we have a common working directory dri_fleat in the
gluster volume
drwxrwsr-x 22 root dri_fleat 4.0K May 1 15:14 dri_fleat
my user (phaley) does not own that directory but is a member of the
group dri_fleat and should have write permissions. When I go to the
nfs-mounted version and try to use the touch command I get the following
ibfdr-compute-0-4(dri_fleat)% touch dum
touch: cannot touch `dum': Permission denied
One of the sub-directories under dri_fleat is "test" which phaley owns
drwxrwsr-x 2 phaley dri_fleat 4.0K May 1 15:16 test
Under this directory (mounted via nfs) user phaley can write
ibfdr-compute-0-4(test)% touch dum
ibfdr-compute-0-4(test)%
I have put the packet captures in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestNFSmount/
capture_nfsfail.pcap has the results from the failed touch experiment
capture_nfssucceed.pcap has the results from the successful touch
experiment
The command I used for these was
tcpdump -i ib0 -nnSs 0 host 172.16.1.119 -w
/root/capture_nfstest.pcap
I hope these pkts were captured on the node where NFS server is
running. Could you please use '-i any' as I do not see glusterfs
traffic in the tcpdump.
Also looks like NFS v4 is used between client & nfs server. Are you
using kernel-NFS here (i.e, kernel-NFS exporting fuse mounted gluster
volume)?
If that is the case please capture fuse-mnt logs as well. This error
may well be coming from kernel-NFS itself before the request is sent
to fuse-mnt process.
FWIW, we have below option -
Option: server.manage-gids
Default Value: off
Description: Resolve groups on the server-side.
I haven't looked into what this option exactly does. But it may worth
testing with this option on.
Thanks,
Soumya
Post by Pat Haley
Post by Pat Haley
The brick log files are also in the above link. If I read them
correctly they both funny times. Specifically I see entries from
around 2017-06-27 14:02:37.404865 even though the system time was
2017-06-27 12:00:00.
One final item, another reply to my post had a link for possible
problems that could arise from users belonging to too many group. We
have seen the above problem even with a user belonging to only 4 groups.
Let me know what additional information I can provide.
Thanks
Pat
Post by Soumya Koduri
Post by Pat Haley
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems
to me
that in order to improve our write times before then, it
would be
good
to solve the group write permissions for gluster mounted via NFS now.
We can then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster
mounted via
NFS
to respect the group write permissions?
Is this owning group or one of the auxiliary groups whose write
permissions are not considered? AFAIK, there are no special
permission checks done by gNFS server when compared to gluster native
client.
Could you please provide simple steps to reproduce the issue and
collect pkt trace and nfs/brick logs as well.
Thanks,
Soumya
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Niels de Vos
2017-06-27 08:13:17 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Pat Haley
Hi All,
Decided to try another tests of gluster mounted via FUSE vs gluster
mounted via NFS, this time using the software we run in production (i.e.
our ocean model writing a netCDF file).
gluster mounted via NFS the run took 2.3 hr
gluster mounted via FUSE: the run took 44.2 hr
The only problem with using gluster mounted via NFS is that it does not
respect the group write permissions which we need.
We have an exercise coming up in the a couple of weeks. It seems to me
that in order to improve our write times before then, it would be good to
solve the group write permissions for gluster mounted via NFS now. We can
then revisit gluster mounted via FUSE afterwards.
What information would you need to help us force gluster mounted via NFS
to respect the group write permissions?
+Niels, +Jiffin
I added 2 more guys who work on NFS to check why this problem happens in
your environment. Let's see what information they may need to find the
problem and solve this issue.
Hi Pat,

depending on the number of groups that a user is part of, you may need
to change some volume options. A complete description of the limitations
on the number of groups can be foune here:

https://github.com/gluster/glusterdocs/blob/master/Administrator%20Guide/Handling-of-users-with-many-groups.md

HTH,
Niels
Post by Pranith Kumar Karampuri
Post by Pat Haley
Thanks
Pat
On Fri, Jun 23, 2017 at 9:10 AM, Pranith Kumar Karampuri <
Post by Pranith Kumar Karampuri
Post by Pat Haley
Hi,
Today we experimented with some of the FUSE options that we found in the
list.
gluster volume set test-volume performance.cache-max-file-size 2MB
gluster volume set test-volume performance.cache-refresh-timeout 4
gluster volume set test-volume performance.cache-size 256MB
gluster volume set test-volume performance.write-behind-window-size 4MB
gluster volume set test-volume performance.write-behind-window-size 8MB
This is a good coincidence, I am meeting with write-behind
maintainer(+Raghavendra G) today for the same doubt. I think we will have
something by EOD IST. I will update you.
Sorry, forgot to update you. It seems like there is a bug in Write-behind
and Facebook guys sent a patch http://review.gluster.org/16079 to fix the
same. But even with that I am not seeing any improvement. May be I am doing
something wrong. Will update you if I find anything more.
Post by Pranith Kumar Karampuri
Changing the following option from its default value made the speed slower
Post by Pat Haley
gluster volume set test-volume performance.write-behind off (on by default)
Changing the following options initially appeared to give a 10% increase
in speed, but this vanished in subsequent tests (we think the apparent
increase may have been to a lighter workload on the computer from other
users)
gluster volume set test-volume performance.stat-prefetch on
gluster volume set test-volume client.event-threads 4
gluster volume set test-volume server.event-threads 4
Can anything be gleaned from these observations? Are there other things
we can try?
Thanks
Pat
Hi Ben,
Sorry this took so long, but we had a real-time forecasting exercise
last week and I could only get to this now.
- Much of the information on our back end system is included at the
top of http://lists.gluster.org/pipermail/gluster-users/2017-April/
030529.html
- The specific model of the hard disks is SeaGate ENTERPRISE
CAPACITY V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
- Note: there is one physical server that hosts both the NFS and the
GlusterFS areas
Latest tests
I have had time to run the tests for one of the dd tests you requested
to the underlying XFS FS. The median rate was 170 MB/s. The dd results
and iostat record are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/
I'll add tests for the other brick and to the NFS area later.
Thanks
Pat
throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----
Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat
----- Original Message -----
Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b
Thanks
Pat
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----
Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat
On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be
running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small
battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On 05/11/2017 12:06 PM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is
our
production hardware.
You said something about /home partition which
has
lesser disks, we can create plain distribute
volume inside one of those directories. After
we
are done, we can remove the setup. What do you
say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar
On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn
more
about the bottlenecks in the system? We
can
probably try to get the speed to 1.2Gb/s
on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar
On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri
Is this the volume
info
you have?
/[root at
mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users> <http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
​I copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
Joe Julian
2017-05-16 16:03:17 UTC
Permalink
Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd
test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>. The results
without oflag=sync were 1.6 Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer
disks).
Pat
Is that true for every disk? If you're choosing the same filename every
time for your dd test, you're likely only doing that test against one
disk. If that disk is slow, you would get the same results every time
despite other disks performing normally.
Post by Pat Haley
Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run
your answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we
also add the dd tests writing to the /home area (no gluster,
still on the same machine)
* dd test without oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would
naively expect the writes to the gluster area to be roughly 8x
faster than to the non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition
as the brick, then we can write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if fuse vs nfs
is part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So
wanted clarifications. Now that I got my clarifications where fuse
outperformed nfs without sync, we can resume testing as described
above and try to find what it is. Based on your email-id I am
guessing you are from Boston and I am from Bangalore so if you are
okay with doing this debugging for multiple days because of
timezones, I will be happy to help. Please be a bit patient with me,
I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information
that is not pertaining to dd so it is difficult to find the
contributions of dd. So I went through your post again and found
something I didn't pay much attention to earlier i.e. oflag=sync, so
did my own tests on my setup with FUSE so sent that reply.
Pat
Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then
each write syscall has to be written to disk as part of the
syscall where as in case of NFS, there is no concept of open.
NFS performs write though a handle saying it needs to be a
synchronous write, so write() syscall is performed first then it
performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are
batched together to be written do disk so the throughput on the
disk is increasing is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each, the
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on both
the mounts? No need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Post by Pranith Kumar Karampuri
Is this the volume info you have?
/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
​I copied this from old thread from 2016. This is
distribute volume. Did you change any of the options
in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Pat Haley
2017-05-03 04:50:01 UTC
Permalink
Hi Pranith & Ravi,

Sorry for the delay. I have the profile info for the past couple of
days just below. Is this of any help to you or is there additional
information I can request?

Brick: mseas-data2:/mnt/brick2
------------------------------
Cumulative Stats:
Block Size: 1b+ 2b+ 4b+
No. of Reads: 6 38 1144
No. of Writes: 108032195 8352125 141319922

Block Size: 8b+ 16b+ 32b+
No. of Reads: 689 1256 2756
No. of Writes: 13946933 20694915 57845473

Block Size: 64b+ 128b+ 256b+
No. of Reads: 5522 56492 149462
No. of Writes: 714398165 11923303 2537176

Block Size: 512b+ 1024b+ 2048b+
No. of Reads: 64285 192872 200488
No. of Writes: 5975842 217173849 94536339

Block Size: 4096b+ 8192b+ 16384b+
No. of Reads: 300021 764297 1613672
No. of Writes: 112481858 53164978 330177486

Block Size: 32768b+ 65536b+ 131072b+
No. of Reads: 5101884 14470916 4958306977
No. of Writes: 35098110 19969017 2243344759

Block Size: 262144b+
No. of Reads: 0
No. of Writes: 547
%-latency Avg-latency Min-Latency Max-Latency No. of
calls Fop
--------- ----------- ----------- -----------
------------ ----
0.00 0.00 us 0.00 us 0.00 us 4052087 FORGET
0.00 0.00 us 0.00 us 0.00 us 6381234 RELEASE
0.00 0.00 us 0.00 us 0.00 us 28716633 RELEASEDIR
0.00 92.81 us 48.00 us 130.00 us 53
READLINK
0.00 201.22 us 112.00 us 457.00 us
188 RMDIR
0.00 169.36 us 53.00 us 20417.00 us 347
SETXATTR
0.00 20497.89 us 241.00 us 57505.00 us 45
SYMLINK
0.00 116.97 us 42.00 us 39168.00 us 9172
SETATTR
0.00 380.06 us 76.00 us 198427.00 us
3133 LINK
0.00 149.60 us 14.00 us 601941.00 us 14426
INODELK
0.00 387.81 us 69.00 us 161114.00 us
6617 RENAME
0.01 96.47 us 14.00 us 1224734.00 us
63599 STATFS
0.01 25041.48 us 299.00 us 93211.00 us
348 MKDIR
0.01 380.41 us 31.00 us 561724.00 us
31452 OPEN
0.02 1346.42 us 64.00 us 226741.00 us
18306 UNLINK
0.02 2123.19 us 42.00 us 802398.00 us 12370
FTRUNCATE
0.04 12161.88 us 175.00 us 158072.00 us
3244 MKNOD
0.07 132801.87 us 39.00 us 3144448.00 us
532 FSYNC
0.13 89.98 us 4.00 us 5550246.00 us 1492793 FLUSH
0.45 65.89 us 6.00 us 3608035.00 us 7194229 FSTAT
0.57 14538.33 us 162.00 us 4577282.00 us
41466 CREATE
0.70 3183.52 us 16.00 us 4358324.00 us 231728
OPENDIR
1.67 7559.32 us 8.00 us 4193443.00 us
234012 STAT
2.26 119.27 us 11.00 us 4491219.00 us 20093638 WRITE
2.51 207.00 us 10.00 us 4993074.00 us 12884466 READ
4.17 246.12 us 13.00 us 8857354.00 us 17952607 GETXATTR
23.72 48775.51 us 14.00 us 5022445.00 us 515770
READDIRP
63.65 1238.53 us 25.00 us 4483760.00 us 54507520 LOOKUP

Duration: 9810315 seconds
Data Read: 651660783328883 bytes
Data Written: 305412177327433 bytes

Interval 0 Stats:
Block Size: 1b+ 2b+ 4b+
No. of Reads: 6 38 1144
No. of Writes: 108032195 8352125 141319922

Block Size: 8b+ 16b+ 32b+
No. of Reads: 689 1256 2756
No. of Writes: 13946933 20694915 57845473

Block Size: 64b+ 128b+ 256b+
No. of Reads: 5522 56492 149462
No. of Writes: 714398165 11923303 2537176

Block Size: 512b+ 1024b+ 2048b+
No. of Reads: 64285 192872 200488
No. of Writes: 5975842 217173849 94536339

Block Size: 4096b+ 8192b+ 16384b+
No. of Reads: 300021 764297 1613672
No. of Writes: 112481858 53164978 330177486

Block Size: 32768b+ 65536b+ 131072b+
No. of Reads: 5101884 14470916 4958306977
No. of Writes: 35098110 19969017 2243344759

Block Size: 262144b+
No. of Reads: 0
No. of Writes: 547
%-latency Avg-latency Min-Latency Max-Latency No. of
calls Fop
--------- ----------- ----------- -----------
------------ ----
0.00 0.00 us 0.00 us 0.00 us 4052087 FORGET
0.00 0.00 us 0.00 us 0.00 us 6381233 RELEASE
0.00 0.00 us 0.00 us 0.00 us 28716630 RELEASEDIR
0.00 92.81 us 48.00 us 130.00 us 53
READLINK
0.00 201.22 us 112.00 us 457.00 us
188 RMDIR
0.00 169.36 us 53.00 us 20417.00 us 347
SETXATTR
0.00 20497.89 us 241.00 us 57505.00 us 45
SYMLINK
0.00 116.97 us 42.00 us 39168.00 us 9172
SETATTR
0.00 380.06 us 76.00 us 198427.00 us
3133 LINK
0.00 149.60 us 14.00 us 601941.00 us 14426
INODELK
0.00 387.81 us 69.00 us 161114.00 us
6617 RENAME
0.01 96.47 us 14.00 us 1224734.00 us
63599 STATFS
0.01 25041.48 us 299.00 us 93211.00 us
348 MKDIR
0.01 380.41 us 31.00 us 561724.00 us
31452 OPEN
0.02 1346.42 us 64.00 us 226741.00 us
18306 UNLINK
0.02 2123.19 us 42.00 us 802398.00 us 12370
FTRUNCATE
0.04 12161.88 us 175.00 us 158072.00 us
3244 MKNOD
0.07 132801.87 us 39.00 us 3144448.00 us
532 FSYNC
0.13 89.98 us 4.00 us 5550246.00 us 1492793 FLUSH
0.45 65.89 us 6.00 us 3608035.00 us 7194229 FSTAT
0.57 14538.33 us 162.00 us 4577282.00 us
41466 CREATE
0.70 3183.52 us 16.00 us 4358324.00 us 231728
OPENDIR
1.67 7559.32 us 8.00 us 4193443.00 us
234012 STAT
2.26 119.27 us 11.00 us 4491219.00 us 20093638 WRITE
2.51 207.00 us 10.00 us 4993074.00 us 12884466 READ
4.17 246.12 us 13.00 us 8857354.00 us 17952607 GETXATTR
23.72 48775.51 us 14.00 us 5022445.00 us 515770
READDIRP
63.65 1238.53 us 25.00 us 4483760.00 us 54507520 LOOKUP

Duration: 9810315 seconds
Data Read: 651660783328883 bytes
Data Written: 305412177327433 bytes

Brick: mseas-data2:/mnt/brick1
------------------------------
Cumulative Stats:
Block Size: 1b+ 2b+ 4b+
No. of Reads: 4 38 1482
No. of Writes: 643631512 59055444 235532859

Block Size: 8b+ 16b+ 32b+
No. of Reads: 1171 2138 4748
No. of Writes: 31816870 23602175 50161322

Block Size: 64b+ 128b+ 256b+
No. of Reads: 9461 65360 165954
No. of Writes: 711114605 11760241 4078907

Block Size: 512b+ 1024b+ 2048b+
No. of Reads: 94563 226053 258803
No. of Writes: 6366990 211643393 95831137

Block Size: 4096b+ 8192b+ 16384b+
No. of Reads: 383871 1032345 2244921
No. of Writes: 155833532 57850303 339892660

Block Size: 32768b+ 65536b+ 131072b+
No. of Reads: 7588068 22368398 5387488199
No. of Writes: 38588368 25195605 2463004132

Block Size: 262144b+
No. of Reads: 0
No. of Writes: 489
%-latency Avg-latency Min-Latency Max-Latency No. of
calls Fop
--------- ----------- ----------- -----------
------------ ----
0.00 0.00 us 0.00 us 0.00 us 4060396 FORGET
0.00 0.00 us 0.00 us 0.00 us 6244016 RELEASE
0.00 0.00 us 0.00 us 0.00 us 28716852 RELEASEDIR
0.00 96.42 us 61.00 us 148.00 us 40
READLINK
0.00 208.36 us 114.00 us 322.00 us
188 RMDIR
0.00 2231.61 us 57.00 us 716342.00 us 347
SETXATTR
0.00 20821.92 us 758.00 us 57852.00 us 38
SYMLINK
0.00 519.11 us 76.00 us 952378.00 us
3149 LINK
0.00 196.97 us 50.00 us 736928.00 us 9055
SETATTR
0.00 164.34 us 18.00 us 736161.00 us 13460
INODELK
0.00 375.54 us 73.00 us 198362.00 us
6274 RENAME
0.01 20913.10 us 351.00 us 102696.00 us
348 MKDIR
0.01 151.39 us 17.00 us 782025.00 us
63598 STATFS
0.03 1103.67 us 34.00 us 618187.00 us
29597 OPEN
0.03 2833.17 us 43.00 us 1069257.00 us 11693
FTRUNCATE
0.04 2267.87 us 61.00 us 3746134.00 us
17859 UNLINK
0.04 13105.16 us 254.00 us 179505.00 us
3177 MKNOD
0.05 88496.76 us 21.00 us 1718559.00 us
613 FSYNC
0.58 73.42 us 6.00 us 1917794.00 us 7848483 FSTAT
0.71 17177.23 us 177.00 us 7077794.00 us
40554 CREATE
0.79 585.79 us 3.00 us 11107703.00 us
1322036 FLUSH
1.72 7459.40 us 9.00 us 2764285.00 us
228033 STAT
1.96 8350.73 us 19.00 us 2235725.00 us 231728
OPENDIR
2.60 115.35 us 12.00 us 4196355.00 us 22239110 WRITE
4.60 313.20 us 10.00 us 6211594.00 us 14494253 READ
5.98 307.95 us 13.00 us 9885480.00 us 19163193 GETXATTR
25.68 48514.34 us 17.00 us 4734636.00 us 522162
READDIRP
55.15 1075.93 us 26.00 us 4291535.00 us 50562855 LOOKUP

Duration: 9810315 seconds
Data Read: 708869551853133 bytes
Data Written: 335305857076797 bytes

Interval 0 Stats:
Block Size: 1b+ 2b+ 4b+
No. of Reads: 4 38 1482
No. of Writes: 643631512 59055444 235532859

Block Size: 8b+ 16b+ 32b+
No. of Reads: 1171 2138 4748
No. of Writes: 31816870 23602175 50161322

Block Size: 64b+ 128b+ 256b+
No. of Reads: 9461 65360 165954
No. of Writes: 711114605 11760241 4078907

Block Size: 512b+ 1024b+ 2048b+
No. of Reads: 94563 226053 258803
No. of Writes: 6366990 211643393 95831137

Block Size: 4096b+ 8192b+ 16384b+
No. of Reads: 383871 1032345 2244921
No. of Writes: 155833532 57850303 339892660

Block Size: 32768b+ 65536b+ 131072b+
No. of Reads: 7588068 22368398 5387488199
No. of Writes: 38588368 25195605 2463004132

Block Size: 262144b+
No. of Reads: 0
No. of Writes: 489
%-latency Avg-latency Min-Latency Max-Latency No. of
calls Fop
--------- ----------- ----------- -----------
------------ ----
0.00 0.00 us 0.00 us 0.00 us 4060397 FORGET
0.00 0.00 us 0.00 us 0.00 us 6244015 RELEASE
0.00 0.00 us 0.00 us 0.00 us 28716850 RELEASEDIR
0.00 96.42 us 61.00 us 148.00 us 40
READLINK
0.00 208.36 us 114.00 us 322.00 us
188 RMDIR
0.00 2231.61 us 57.00 us 716342.00 us 347
SETXATTR
0.00 20821.92 us 758.00 us 57852.00 us 38
SYMLINK
0.00 519.11 us 76.00 us 952378.00 us
3149 LINK
0.00 196.97 us 50.00 us 736928.00 us 9055
SETATTR
0.00 164.34 us 18.00 us 736161.00 us 13460
INODELK
0.00 375.54 us 73.00 us 198362.00 us
6274 RENAME
0.01 20913.10 us 351.00 us 102696.00 us
348 MKDIR
0.01 151.39 us 17.00 us 782025.00 us
63598 STATFS
0.03 1103.67 us 34.00 us 618187.00 us
29597 OPEN
0.03 2833.17 us 43.00 us 1069257.00 us 11693
FTRUNCATE
0.04 2267.87 us 61.00 us 3746134.00 us
17859 UNLINK
0.04 13105.16 us 254.00 us 179505.00 us
3177 MKNOD
0.05 88496.76 us 21.00 us 1718559.00 us
613 FSYNC
0.58 73.42 us 6.00 us 1917794.00 us 7848483 FSTAT
0.71 17177.23 us 177.00 us 7077794.00 us
40554 CREATE
0.79 585.79 us 3.00 us 11107703.00 us
1322036 FLUSH
1.72 7459.40 us 9.00 us 2764285.00 us
228033 STAT
1.96 8350.73 us 19.00 us 2235725.00 us 231728
OPENDIR
2.60 115.35 us 12.00 us 4196355.00 us 22239110 WRITE
4.60 313.20 us 10.00 us 6211594.00 us 14494253 READ
5.98 307.95 us 13.00 us 9885480.00 us 19163193 GETXATTR
25.68 48514.34 us 17.00 us 4734636.00 us 522162
READDIRP
55.15 1075.93 us 26.00 us 4291535.00 us 50562855 LOOKUP

Duration: 9810315 seconds
Data Read: 708869551853133 bytes
Data Written: 335305857076797 bytes
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single
point of failure. Unlike fuse mounts, if the gluster node
containing the gnfs server goes down, all mounts done using that
node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs
mounts, you can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than
gNFS servers?
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk
when compared to writing to an NFS disk. Specifically when using
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3
3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: ***@mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
Joe Julian
2017-05-16 16:08:12 UTC
Permalink
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single
point of failure. Unlike fuse mounts, if the gluster node
containing the gnfs server goes down, all mounts done using that
node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs
mounts, you can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than
gNFS servers?
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
I have done actual testing. For directory ops, NFS is faster due to the
default cache settings in the kernel. For raw throughput, or ops on an
open file, fuse is faster.

I have yet to test this but I expect with the newer caching features in
3.8+, even directory op performance should be similar to nfs and more
accurate.
Post by Ravishankar N
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk
when compared to writing to an NFS disk. Specifically when using
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3
3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
Pranith Kumar Karampuri
2017-05-17 09:02:04 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it helps, you
could try mounting it via gluster NFS (gnfs) and then see if there is an
improvement in speed. Fuse mounts are slower than gnfs mounts but you get
the benefit of avoiding a single point of failure. Unlike fuse mounts, if
the gluster node containing the gnfs server goes down, all mounts done
using that node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs mounts, you
can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than gNFS
servers?
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
I have done actual testing. For directory ops, NFS is faster due to the
default cache settings in the kernel. For raw throughput, or ops on an open
file, fuse is faster.
I have yet to test this but I expect with the newer caching features in
3.8+, even directory op performance should be similar to nfs and more
accurate.
We are actually comparing fuse+gluster and kernel NFS (n the same brick.
Did you get a chance to do this test at any point?
Post by Pranith Kumar Karampuri
You can follow https://gluster.readthedocs.io/en/latest/Administrator%
20Guide/Monitoring%20Workload/ to get the information.
Post by Ravishankar N
Thanks,
Ravi
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
- on NFS disk (/home): 9.5 Gb/s
- on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or anything
- one server with 70 hard disks and a hardware RAID card.
- 4 disks in a RAID-6 group (the NFS disk)
- 32 disks in a RAID-6 group (the max allowed by the card,
/mnt/brick1)
- 32 disks in another RAID-6 group (/mnt/brick2)
- 2 hot spare
Some additional information and more tests results (after changing the
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast
gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time (gluster to gluster)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint (/mnt/brick1)
and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
_______________________________________________ Gluster-users mailing
an/listinfo/gluster-users
--
Pranith
_______________________________________________
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
--
Pranith
Joe Julian
2017-05-17 16:24:38 UTC
Permalink
Post by Joe Julian
On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and
then see if there is an improvement in speed. Fuse mounts are
slower than gnfs mounts but you get the benefit of avoiding a
single point of failure. Unlike fuse mounts, if the gluster
node containing the gnfs server goes down, all mounts done
using that node will fail). For fuse mounts, you could try
tweaking the write-behind xlator settings to see if it helps.
See the performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower
than gNFS servers?
Pat,
I see that I am late to the thread, but do you happen to
have "profile info" of the workload?
I have done actual testing. For directory ops, NFS is faster due
to the default cache settings in the kernel. For raw throughput,
or ops on an open file, fuse is faster.
I have yet to test this but I expect with the newer caching
features in 3.8+, even directory op performance should be similar
to nfs and more accurate.
We are actually comparing fuse+gluster and kernel NFS (n the same
brick. Did you get a chance to do this test at any point?
No, that's not comparing like to like and I've rarely had a use case to
which a single-store NFS was the answer.
Post by Joe Julian
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.
Thanks,
Ravi
Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk. Specifically
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card,
/mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3
times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30
times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30
times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith
Pranith Kumar Karampuri
2017-05-17 18:55:27 UTC
Permalink
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it helps, you
could try mounting it via gluster NFS (gnfs) and then see if there is an
improvement in speed. Fuse mounts are slower than gnfs mounts but you get
the benefit of avoiding a single point of failure. Unlike fuse mounts, if
the gluster node containing the gnfs server goes down, all mounts done
using that node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs mounts, you
can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than gNFS
servers?
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
I have done actual testing. For directory ops, NFS is faster due to the
default cache settings in the kernel. For raw throughput, or ops on an open
file, fuse is faster.
I have yet to test this but I expect with the newer caching features in
3.8+, even directory op performance should be similar to nfs and more
accurate.
We are actually comparing fuse+gluster and kernel NFS (n the same brick.
Did you get a chance to do this test at any point?
No, that's not comparing like to like and I've rarely had a use case to
which a single-store NFS was the answer.
Exactly. Why is it so bad compared to kNFS? Is there any scope for
improvement is the question we are trying to find answer to. If there is
everyone wins :-)

PS: I may not respond till tomorrow. Will go to sleep now.
Post by Pranith Kumar Karampuri
Post by Pranith Kumar Karampuri
You can follow https://gluster.readthedocs.io
/en/latest/Administrator%20Guide/Monitoring%20Workload/ to get the
information.
Post by Ravishankar N
Thanks,
Ravi
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
- on NFS disk (/home): 9.5 Gb/s
- on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or anything
- one server with 70 hard disks and a hardware RAID card.
- 4 disks in a RAID-6 group (the NFS disk)
- 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
- 32 disks in another RAID-6 group (/mnt/brick2)
- 2 hot spare
Some additional information and more tests results (after changing the
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast
gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time (gluster to gluster)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint (/mnt/brick1)
and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
_______________________________________________ Gluster-users mailing
an/listinfo/gluster-users
--
Pranith
_______________________________________________
_______________________________________________ Gluster-users mailing
an/listinfo/gluster-users
--
Pranith
--
Pranith
Loading...