[Gluster-users] Slow write times to gluster disk

Hi Pat,

I'm assuming you are using gluster native (fuse mount). If it helps, you
could try mounting it via gluster NFS (gnfs) and then see if there is an
improvement in speed. Fuse mounts are slower than gnfs mounts but you
get the benefit of avoiding a single point of failure. Unlike fuse
mounts, if the gluster node containing the gnfs server goes down, all
mounts done using that node will fail). For fuse mounts, you could try
tweaking the write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs mounts,
you can achieve fail-over by using CTDB.

Thanks,
Ravi

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing the
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Pat Haley

2017-04-10 19:12:45 UTC

Hi Ravi,

Thanks for the reply. And yes, we are using the gluster native (fuse)
mount. Since this is not my area of expertise I have a few questions
(mostly clarifications)

Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be looking
for additional issues? (Note the first dd test described below was run
on the server that hosts the file-systems so no network communication
was involved).

You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from fuse
to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?

My next question is to make sure I'm clear on the comment " if the
gluster node containing the gnfs server goes down, all mounts done using
that node will fail". If you have 2 servers, each 1 brick in the
over-all gluster FS, and one server fails, then for gnfs nothing on
either server is visible to other nodes while under fuse only the files
on the dead server are not visible. Is this what you meant?

Finally, you mention "even for gnfs mounts, you can achieve fail-over by
using CTDB". Do you know if CTDB would have any performance impact
(i.e. in a worst cast scenario could adding CTDB to gnfs erase the speed
benefits of going to gnfs in the first place)?

Thanks

Pat

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Ravishankar N

2017-04-11 04:21:21 UTC

Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native (fuse)
mount. Since this is not my area of expertise I have a few questions
(mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no network
communication was involved).

Though both the gluster bricks and the mounts are on the same physical
machine in your setup, the I/O still passes through different layers of
kernel/user-space fuse stack although I don't know if 20x slow down on
gluster vs NFS share is normal. Why don't you try doing a gluster NFS
mount on the machine and try the dd test and compare it with the gluster
fuse mount results?

Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from fuse
to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?

You should test these out and find the answers yourself. :-)

Post by Pat Haley
My next question is to make sure I'm clear on the comment " if the
gluster node containing the gnfs server goes down, all mounts done
using that node will fail". If you have 2 servers, each 1 brick in
the over-all gluster FS, and one server fails, then for gnfs nothing
on either server is visible to other nodes while under fuse only the
files on the dead server are not visible. Is this what you meant?

Yes, for gnfs mounts, all I/O from various mounts go to the gnfs server
process (on the machine whose IP was used at the time of mounting) which
then sends the I/O to the brick processes. For fuse, the gluster fuse
mount itself talks directly to the bricks.

Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve fail-over
by using CTDB". Do you know if CTDB would have any performance impact
(i.e. in a worst cast scenario could adding CTDB to gnfs erase the
speed benefits of going to gnfs in the first place)?

I don't think it would. You can even achieve load balancing via CTDB to
use different gnfs servers for different clients. But I don't know if
this is needed/ helpful in your current setup where everything (bricks
and clients) seem to be on just one server.

-Ravi

Post by Pat Haley
Thanks
Pat

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card,
/mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Pat Haley

2017-04-13 22:18:30 UTC

Hi Ravi (and list),

We are planning on testing the NFS route to see what kind of speed-up we
get. A little research led us to the following:

https://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/

Is this correct path to take to mount 2 xfs volumes as a single gluster
file system volume? If not, what would be a better path?

Pat

Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native
(fuse) mount. Since this is not my area of expertise I have a few
questions (mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no network
communication was involved).

Though both the gluster bricks and the mounts are on the same physical
machine in your setup, the I/O still passes through different layers
of kernel/user-space fuse stack although I don't know if 20x slow down
on gluster vs NFS share is normal. Why don't you try doing a gluster
NFS mount on the machine and try the dd test and compare it with the
gluster fuse mount results?

Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from
fuse to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?

You should test these out and find the answers yourself. :-)

Yes, for gnfs mounts, all I/O from various mounts go to the gnfs
server process (on the machine whose IP was used at the time of
mounting) which then sends the I/O to the brick processes. For fuse,
the gluster fuse mount itself talks directly to the bricks.

Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve fail-over
by using CTDB". Do you know if CTDB would have any performance
impact (i.e. in a worst cast scenario could adding CTDB to gnfs erase
the speed benefits of going to gnfs in the first place)?

I don't think it would. You can even achieve load balancing via CTDB
to use different gnfs servers for different clients. But I don't know
if this is needed/ helpful in your current setup where everything
(bricks and clients) seem to be on just one server.
-Ravi

Post by Pat Haley
Thanks
Pat

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Ravishankar N

2017-04-14 04:57:22 UTC

I'm not sure if the version you are running (glusterfs 3.7.11 ) works
with NFS-Ganesha as the link seems to suggest version >=3.8 as a
per-requisite. Adding Soumya for help. If it is not supported, then you
might have to go the plain glusterNFS way.
Regards,
Ravi

Post by Pat Haley
Hi Ravi (and list),
We are planning on testing the NFS route to see what kind of speed-up
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/
Is this correct path to take to mount 2 xfs volumes as a single
gluster file system volume? If not, what would be a better path?
Pat

Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native
(fuse) mount. Since this is not my area of expertise I have a few
questions (mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no
network communication was involved).

Though both the gluster bricks and the mounts are on the same
physical machine in your setup, the I/O still passes through
different layers of kernel/user-space fuse stack although I don't
know if 20x slow down on gluster vs NFS share is normal. Why don't
you try doing a gluster NFS mount on the machine and try the dd test
and compare it with the gluster fuse mount results?

Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would you
expect better speed improvements from switching the mounting from
fuse to gnfs or from tweaking the settings? Also are these mutually
exclusive or would the be additional benefits from both switching to
gfns and tweaking?

You should test these out and find the answers yourself. :-)

Yes, for gnfs mounts, all I/O from various mounts go to the gnfs
server process (on the machine whose IP was used at the time of
mounting) which then sends the I/O to the brick processes. For fuse,
the gluster fuse mount itself talks directly to the bricks.

Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve
fail-over by using CTDB". Do you know if CTDB would have any
performance impact (i.e. in a worst cast scenario could adding CTDB
to gnfs erase the speed benefits of going to gnfs in the first place)?

I don't think it would. You can even achieve load balancing via CTDB
to use different gnfs servers for different clients. But I don't know
if this is needed/ helpful in your current setup where everything
(bricks and clients) seem to be on just one server.
-Ravi

Post by Pat Haley
Thanks
Pat

Soumya Koduri

2017-04-17 07:18:41 UTC

Post by Ravishankar N
I'm not sure if the version you are running (glusterfs 3.7.11 ) works
with NFS-Ganesha as the link seems to suggest version >=3.8 as a
per-requisite. Adding Soumya for help. If it is not supported, then you
might have to go the plain glusterNFS way.

Even gluster 3.7.x shall work with NFS-Ganesha but the steps to
configure had changed from 3.8 and hence the pre-requisite was added in
the doc. IIUC, from your below mail, you would like to try NFS
(preferably gNFS but not NFS-Ganesha) which may perform better compared
to fuse mount. In that case, gNFS server comes up by default (till
release-3.7.x) and there are additional steps needed to export volume
via gNFS. Let me know if you have any issues accessing volumes via gNFS.

Regards,
Soumya

Post by Ravishankar N
Regards,
Ravi

Post by Pat Haley
Hi Ravi,
Thanks for the reply. And yes, we are using the gluster native
(fuse) mount. Since this is not my area of expertise I have a few
questions (mostly clarifications)
Is a factor of 20 slow-down typical when compare a fuse-mounted
filesytem versus an NFS-mounted filesystem or should we also be
looking for additional issues? (Note the first dd test described
below was run on the server that hosts the file-systems so no
network communication was involved).

Though both the gluster bricks and the mounts are on the same
physical machine in your setup, the I/O still passes through
different layers of kernel/user-space fuse stack although I don't
know if 20x slow down on gluster vs NFS share is normal. Why don't
you try doing a gluster NFS mount on the machine and try the dd test
and compare it with the gluster fuse mount results?

Post by Pat Haley
You also mention tweaking " write-behind xlator settings". Would
you expect better speed improvements from switching the mounting
from fuse to gnfs or from tweaking the settings? Also are these
mutually exclusive or would the be additional benefits from both
switching to gfns and tweaking?

You should test these out and find the answers yourself. :-)

Yes, for gnfs mounts, all I/O from various mounts go to the gnfs
server process (on the machine whose IP was used at the time of
mounting) which then sends the I/O to the brick processes. For fuse,
the gluster fuse mount itself talks directly to the bricks.

Post by Pat Haley
Finally, you mention "even for gnfs mounts, you can achieve
fail-over by using CTDB". Do you know if CTDB would have any
performance impact (i.e. in a worst cast scenario could adding CTDB
to gnfs erase the speed benefits of going to gnfs in the first place)?

I don't think it would. You can even achieve load balancing via CTDB
to use different gnfs servers for different clients. But I don't know
if this is needed/ helpful in your current setup where everything
(bricks and clients) seem to be on just one server.
-Ravi

Post by Pat Haley
Thanks
Pat

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after changing
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Pranith Kumar Karampuri

2017-04-14 06:50:54 UTC

Ravi,
Do you have any data that suggests fuse mounts are slower than gNFS
servers?

Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?

You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.

Post by Ravishankar N
Thanks,
Ravi
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
- on NFS disk (/home): 9.5 Gb/s
- on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or anything
- one server with 70 hard disks and a hardware RAID card.
- 4 disks in a RAID-6 group (the NFS disk)
- 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
- 32 disks in another RAID-6 group (/mnt/brick2)
- 2 hot spare
Some additional information and more tests results (after changing the log
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast
gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time (gluster to gluster)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint (/mnt/brick1)
and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

--
Pranith

Ravishankar N

2017-04-14 07:01:41 UTC

Post by Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single
point of failure. Unlike fuse mounts, if the gluster node
containing the gnfs server goes down, all mounts done using that
node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs
mounts, you can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than
gNFS servers?

I have heard anecdotal evidence time and again on the ML and IRC, which
is why I wanted to compare it with NFS numbers on his setup.

Post by Ravishankar N
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.

Yeah, Let's see if profile info shows up anything interesting.
-Ravi

Post by Ravishankar N
Thanks,
Ravi

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster disk
when compared to writing to an NFS disk. Specifically when using
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3
3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith

Pat Haley

2017-05-05 14:44:30 UTC

Hi Pranith & Ravi,

A couple of quick questions

We have profile turned on. Are there specific queries we should make
that would help debug our configuration? (The default profile info was
previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
but I'm not sure if that is what you were looking for.)

We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in debugging this?

Thanks

Pat

On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If it
helps, you could try mounting it via gluster NFS (gnfs) and then
see if there is an improvement in speed. Fuse mounts are slower
than gnfs mounts but you get the benefit of avoiding a single
point of failure. Unlike fuse mounts, if the gluster node
containing the gnfs server goes down, all mounts done using that
node will fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and performance.write-behind-window-size
options in `gluster volume set help`. Of course, even for gnfs
mounts, you can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower than
gNFS servers?

I have heard anecdotal evidence time and again on the ML and IRC,
which is why I wanted to compare it with NFS numbers on his setup.

Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
to get the information.

Yeah, Let's see if profile info shows up anything interesting.
-Ravi

Thanks,
Ravi

_______________________________________________ Gluster-users
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith

Pranith Kumar Karampuri

2017-05-05 14:58:21 UTC

hi Pat,
Let us concentrate on the performance numbers part for now. We will
look at the permissions one after this?

As per the profile info, only 2.6% of the work-load is writes. There are
too many Lookups.

Would it be possible to get the data for just the dd test you were doing
earlier?

Post by Pat Haley
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should make that
would help debug our configuration? (The default profile info was
previously sent in http://lists.gluster.org/pipermail/gluster-users/2017-
May/030840.html but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We rediscovered
an issue we previously reported ( http://lists.gluster.org/
pipermail/gluster-users/2016-September/028289.html ) in that the NFS
mounted version was ignoring the group write permissions. What specific
information would be useful in debugging this?
Thanks
Pat

Ravi,
Do you have any data that suggests fuse mounts are slower than gNFS
servers?
I have heard anecdotal evidence time and again on the ML and IRC, which is
why I wanted to compare it with NFS numbers on his setup.
Pat,
I see that I am late to the thread, but do you happen to have
"profile info" of the workload?
You can follow https://gluster.readthedocs.io/en/latest/Administrator%
20Guide/Monitoring%20Workload/ to get the information.
Yeah, Let's see if profile info shows up anything interesting.
-Ravi

Post by Ravishankar N
Thanks,
Ravi
Hi,
We noticed a dramatic slowness when writing to a gluster disk when
compared to writing to an NFS disk. Specifically when using dd (data
- on NFS disk (/home): 9.5 Gb/s
- on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication or anything
- one server with 70 hard disks and a hardware RAID card.
- 4 disks in a RAID-6 group (the NFS disk)
- 32 disks in a RAID-6 group (the max allowed by the card,
/mnt/brick1)
- 32 disks in another RAID-6 group (/mnt/brick2)
- 2 hot spare
Some additional information and more tests results (after changing the
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108
[Invader] (rev 02)
*Create the file to /gdata (gluster)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s - *3 times as fast
gdata]# dd if=/gdata/zero1 of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* - realllyyy
slooowww
*Copy from /gdata to /gdata* *2nd time (gluster to gluster)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* - realllyyy
slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30 times as fast
As a test, can we copy data directly to the xfs mountpoint (/mnt/brick1)
and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
_______________________________________________ Gluster-users mailing
an/listinfo/gluster-users

--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301

--
Pranith

Pat Haley

2017-05-05 15:12:31 UTC

Hi Pranith,

I presume you are asking for some version of the profile data that just
shows the dd test (or a repeat of the dd test). If yes, how do I
extract just that data?

Thanks

Pat

Post by Pranith Kumar Karampuri
hi Pat,
Let us concentrate on the performance numbers part for now. We
will look at the permissions one after this?
As per the profile info, only 2.6% of the work-load is writes. There
are too many Lookups.
Would it be possible to get the data for just the dd test you were
doing earlier?
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should
make that would help debug our configuration? (The default
profile info was previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
<http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html>
but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
<http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html>
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in
debugging this?
Thanks
Pat

On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If
it helps, you could try mounting it via gluster NFS (gnfs)
and then see if there is an improvement in speed. Fuse
mounts are slower than gnfs mounts but you get the benefit
of avoiding a single point of failure. Unlike fuse mounts,
if the gluster node containing the gnfs server goes down,
all mounts done using that node will fail). For fuse mounts,
you could try tweaking the write-behind xlator settings to
see if it helps. See the performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower
than gNFS servers?

I have heard anecdotal evidence time and again on the ML and IRC,
which is why I wanted to compare it with NFS numbers on his setup.

Pat,
I see that I am late to the thread, but do you happen to
have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.

Yeah, Let's see if profile info shows up anything interesting.
-Ravi

Thanks,
Ravi

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk. Specifically
when using dd (data duplicator) to write a 4.3 GB file of
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no replication
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s -
*3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30
times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* - 30
times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith

Ravishankar N

2017-05-05 16:47:23 UTC

Post by Pat Haley
Hi Pranith,
I presume you are asking for some version of the profile data that
just shows the dd test (or a repeat of the dd test). If yes, how do I
extract just that data?

Yes, that is what he is asking for. Just clear the existing profile info
using `gluster volume profile volname clear` and run the dd test once.
Then when you run profile info again, it should just give you the stats
for the dd test.

Post by Pat Haley
Thanks
Pat

On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If
it helps, you could try mounting it via gluster NFS (gnfs)
and then see if there is an improvement in speed. Fuse
mounts are slower than gnfs mounts but you get the benefit
of avoiding a single point of failure. Unlike fuse mounts,
if the gluster node containing the gnfs server goes down,
all mounts done using that node will fail). For fuse
mounts, you could try tweaking the write-behind xlator
settings to see if it helps. See the
performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are slower
than gNFS servers?

I have heard anecdotal evidence time and again on the ML and
IRC, which is why I wanted to compare it with NFS numbers on his
setup.

Pat,
I see that I am late to the thread, but do you happen to
have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.

Yeah, Let's see if profile info shows up anything interesting.
-Ravi

Thanks,
Ravi

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk. Specifically
when using dd (data duplicator) to write a 4.3 GB file of
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no
replication or anything else. The hardware is (literally)
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s -
*3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s* -
realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s* -
realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s *30
times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* -
30 times as fast
As a test, can we copy data directly to the xfs mountpoint
(/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith

Pat Haley

2017-05-06 00:11:26 UTC

Hi,

We redid the dd tests (this time using conv=sync oflag=sync to avoid
caching questions). The profile results are in

http://mseas.mit.edu/download/phaley/GlusterUsers/profile_gluster_fuse_test

Post by Pat Haley
Hi Pranith,
I presume you are asking for some version of the profile data that
just shows the dd test (or a repeat of the dd test). If yes, how do
I extract just that data?

Yes, that is what he is asking for. Just clear the existing profile
info using `gluster volume profile volname clear` and run the dd test
once. Then when you run profile info again, it should just give you
the stats for the dd test.

Post by Pat Haley
Thanks
Pat

On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount). If
it helps, you could try mounting it via gluster NFS (gnfs)
and then see if there is an improvement in speed. Fuse
mounts are slower than gnfs mounts but you get the benefit
of avoiding a single point of failure. Unlike fuse mounts,
if the gluster node containing the gnfs server goes down,
all mounts done using that node will fail). For fuse
mounts, you could try tweaking the write-behind xlator
settings to see if it helps. See the
performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you can
achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are
slower than gNFS servers?

I have heard anecdotal evidence time and again on the ML and
IRC, which is why I wanted to compare it with NFS numbers on
his setup.

Pat,
I see that I am late to the thread, but do you happen to
have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.

Yeah, Let's see if profile info shows up anything interesting.
-Ravi

Thanks,
Ravi

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk.
Specifically when using dd (data duplicator) to write a
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no
replication or anything else. The hardware is (literally)
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results (after
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s -
*3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s*
- realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s*
- realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s
*30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* -
30 times as fast
As a test, can we copy data directly to the xfs
mountpoint (/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith

Pat Haley

2017-05-10 14:32:48 UTC

Hi,

We finally managed to do the dd tests for an NFS-mounted gluster file
system. The profile results during that test are in

http://mseas.mit.edu/download/phaley/GlusterUsers/profile_gluster_nfs_test

The summary of the dd tests are

* writing to gluster disk mounted with fuse: 5 Mb/s
* writing to gluster disk mounted with nfs: 200 Mb/s

Pat

Post by Pat Haley
Hi,
We redid the dd tests (this time using conv=sync oflag=sync to avoid
caching questions). The profile results are in
http://mseas.mit.edu/download/phaley/GlusterUsers/profile_gluster_fuse_test

Post by Pat Haley
Hi Pranith,
I presume you are asking for some version of the profile data that
just shows the dd test (or a repeat of the dd test). If yes, how do
I extract just that data?

Yes, that is what he is asking for. Just clear the existing profile
info using `gluster volume profile volname clear` and run the dd test
once. Then when you run profile info again, it should just give you
the stats for the dd test.

Post by Pat Haley
Thanks
Pat

Post by Pranith Kumar Karampuri
hi Pat,
Let us concentrate on the performance numbers part for now.
We will look at the permissions one after this?
As per the profile info, only 2.6% of the work-load is writes.
There are too many Lookups.
Would it be possible to get the data for just the dd test you were
doing earlier?
Hi Pranith & Ravi,
A couple of quick questions
We have profile turned on. Are there specific queries we should
make that would help debug our configuration? (The default
profile info was previously sent in
http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html
<http://lists.gluster.org/pipermail/gluster-users/2017-May/030840.html>
but I'm not sure if that is what you were looking for.)
We also started to do a test on serving gluster over NFS. We
rediscovered an issue we previously reported (
http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html
<http://lists.gluster.org/pipermail/gluster-users/2016-September/028289.html>
) in that the NFS mounted version was ignoring the group write
permissions. What specific information would be useful in debugging this?
Thanks
Pat

On Sat, Apr 8, 2017 at 10:28 AM, Ravishankar N
Hi Pat,
I'm assuming you are using gluster native (fuse mount).
If it helps, you could try mounting it via gluster NFS
(gnfs) and then see if there is an improvement in speed.
Fuse mounts are slower than gnfs mounts but you get the
benefit of avoiding a single point of failure. Unlike
fuse mounts, if the gluster node containing the gnfs
server goes down, all mounts done using that node will
fail). For fuse mounts, you could try tweaking the
write-behind xlator settings to see if it helps. See the
performance.write-behind and
performance.write-behind-window-size options in `gluster
volume set help`. Of course, even for gnfs mounts, you
can achieve fail-over by using CTDB.
Ravi,
Do you have any data that suggests fuse mounts are
slower than gNFS servers?

I have heard anecdotal evidence time and again on the ML and
IRC, which is why I wanted to compare it with NFS numbers on
his setup.

Pat,
I see that I am late to the thread, but do you happen
to have "profile info" of the workload?
You can follow
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
<https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/>
to get the information.

Yeah, Let's see if profile info shows up anything interesting.
-Ravi

Thanks,
Ravi

Post by Pat Haley
Hi,
We noticed a dramatic slowness when writing to a gluster
disk when compared to writing to an NFS disk.
Specifically when using dd (data duplicator) to write a
* on NFS disk (/home): 9.5 Gb/s
* on gluster disk (/gdata): 508 Mb/s
The gluser disk is 2 bricks joined together, no
replication or anything else. The hardware is
* one server with 70 hard disks and a hardware RAID card.
* 4 disks in a RAID-6 group (the NFS disk)
* 32 disks in a RAID-6 group (the max allowed by the
card, /mnt/brick1)
* 32 disks in another RAID-6 group (/mnt/brick2)
* 2 hot spare
Some additional information and more tests results
glusterfs 3.7.11 built on Apr 27 2016 14:09:22
CentOS release 6.8 (Final)
RAID bus controller: LSI Logic / Symbios Logic MegaRAID
SAS-3 3108 [Invader] (rev 02)
*Create the file to /gdata (gluster)*
of=/gdata/zero1 bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.91876 s, *546 MB/s*
*Create the file to /home (ext4)*
bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.686021 s, *1.5 GB/s
- *3 times as fast*
Copy from /gdata to /gdata (gluster to gluster)
of=/gdata/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 101.052 s, *10.4 MB/s*
- realllyyy slooowww
*Copy from /gdata to /gdata* *2nd time *(gluster to gluster)**
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 92.4904 s, *11.3 MB/s*
- realllyyy slooowww again
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.53263 s, *297 MB/s
*30 times as fast
*Copy from /home to /home (ext4 to ext4)*
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 4.1737 s, *251 MB/s* -
30 times as fast
As a test, can we copy data directly to the xfs
mountpoint (/mnt/brick1) and bypass gluster?
Any help you could give us would be appreciated.
Thanks
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
--
Pranith

Pranith Kumar Karampuri

2017-05-10 15:44:04 UTC

Is this the volume info you have?

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

Pat Haley

2017-05-10 15:47:17 UTC

Here is what I see now:

[***@mseas-data2 ~]# gluster volume info

Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster
volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?

Pranith Kumar Karampuri

2017-05-10 15:53:55 UTC

Could you let me know the speed without oflag=sync on both the mounts? No
need to collect profiles.

Post by Pranith Kumar Karampuri
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute volume. Did
you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301

--
Pranith

Pat Haley

2017-05-10 16:05:25 UTC

Without the oflag=sync and only a single test of each, the FUSE is going
faster than NFS:

FUSE:
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s

NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s

Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on both the mounts?
No need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users> ~]#
gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?

Pranith Kumar Karampuri

2017-05-10 16:15:44 UTC

Okay good. At least this validates my doubts. Handling O_SYNC in gluster
NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.

Does it answer your doubts?

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is going
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts? No
need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301

--
Pranith

Pat Haley

2017-05-10 16:45:04 UTC

Hi Pranith,

Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.

I am also uncertain about how to interpret the results when we also add
the dd tests writing to the /home area (no gluster, still on the same
machine)

* dd test without oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s

Given that the non-gluster area is a RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster than to the non-gluster.

I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem. Was there anything useful in the profiles?

Pat

Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where
as in case of NFS, there is no concept of open. NFS performs write
though a handle saying it needs to be a synchronous write, so write()
syscall is performed first then it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am suspecting that when multiple
threads do this write+fsync() operation on the same file, multiple
writes are batched together to be written do disk so the throughput on
the disk is increasing is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s

Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users> ~]#
gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?

Pranith Kumar Karampuri

2017-05-10 17:27:46 UTC

I think a better test is to try and write to a file using nfs without any
gluster to a location that is not inside the brick but someother location
that is on same disk(s). If you are mounting the partition as the brick,
then we can write to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem.

I got interested in the post because I read that fuse speed is lesser than
nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.

Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in gluster
NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

--
Pranith

Pat Haley

2017-05-10 21:18:26 UTC

Hi Pranith,

Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but not as
fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/
fewer disks).

Pat

Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your
answer by some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we
also add the dd tests writing to the /home area (no gluster, still
on the same machine)
* dd test without oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each
brick of the gluster area is a RAID-6 of 32 disks, I would naively
expect the writes to the gluster area to be roughly 8x faster than
to the non-gluster.
I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as
the brick, then we can write to a file inside .glusterfs directory,
something like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if fuse vs nfs
is part of the problem.
I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So
wanted clarifications. Now that I got my clarifications where fuse
outperformed nfs without sync, we can resume testing as described
above and try to find what it is. Based on your email-id I am guessing
you are from Boston and I am from Bangalore so if you are okay with
doing this debugging for multiple days because of timezones, I will be
happy to help. Please be a bit patient with me, I am under a release
crunch but I am very curious with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information
that is not pertaining to dd so it is difficult to find the
contributions of dd. So I went through your post again and found
something I didn't pay much attention to earlier i.e. oflag=sync, so
did my own tests on my setup with FUSE so sent that reply.
Pat

Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall
where as in case of NFS, there is no concept of open. NFS
performs write though a handle saying it needs to be a
synchronous write, so write() syscall is performed first then it
performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are
batched together to be written do disk so the throughput on the
disk is increasing is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each, the
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s

Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on both
the mounts? No need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread from 2016. This is
distribute volume. Did you change any of the options in
between?

Pranith Kumar Karampuri

2017-05-11 11:05:41 UTC

Okay, then 1.6Gb/s is what we need to target for, considering your volume
is just distribute. Is there any way you can do tests on similar hardware
but at a small scale? Just so we can run the workload to learn more about
the bottlenecks in the system? We can probably try to get the speed to
1.2Gb/s on your /home partition you were telling me yesterday. Let me know
if that is something you are okay to do.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem.

I got interested in the post because I read that fuse speed is lesser than
nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Pat Haley

2017-05-11 15:27:44 UTC

Hi Pranith,

Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

Pat

Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd
test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but
not as fast as I was expecting given the 1.2 Gb/s to the
no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on
similar hardware but at a small scale? Just so we can run the workload
to learn more about the bottlenecks in the system? We can probably try
to get the speed to 1.2Gb/s on your /home partition you were telling
me yesterday. Let me know if that is something you are okay to do.
Pat

Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll
run your answer by some other people who are more familiar
with this.
I am also uncertain about how to interpret the results when
we also add the dd tests writing to the /home area (no
gluster, still on the same machine)
* dd test without oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while
each brick of the gluster area is a RAID-6 of 32 disks, I
would naively expect the writes to the gluster area to be
roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs
without any gluster to a location that is not inside the brick
but someother location that is on same disk(s). If you are
mounting the partition as the brick, then we can write to a file
inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if fuse vs
nfs is part of the problem.
I got interested in the post because I read that fuse speed is
lesser than nfs speed which is counter-intuitive to my
understanding. So wanted clarifications. Now that I got my
clarifications where fuse outperformed nfs without sync, we can
resume testing as described above and try to find what it is.
Based on your email-id I am guessing you are from Boston and I am
from Bangalore so if you are okay with doing this debugging for
multiple days because of timezones, I will be happy to help.
Please be a bit patient with me, I am under a release crunch but
I am very curious with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are
collecting the profiles from an active volume, so it has a lot of
information that is not pertaining to dd so it is difficult to
find the contributions of dd. So I went through your post again
and found something I didn't pay much attention to earlier i.e.
oflag=sync, so did my own tests on my setup with FUSE so sent
that reply.
Pat

Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling
O_SYNC in gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then
each write syscall has to be written to disk as part of the
syscall where as in case of NFS, there is no concept of
open. NFS performs write though a handle saying it needs to
be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with
O_SYNC becomes write+fsync. I am suspecting that when
multiple threads do this write+fsync() operation on the same
file, multiple writes are batched together to be written do
disk so the throughput on the disk is increasing is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each,
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s

Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync on
both the mounts? No need to collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread from 2016. This is
distribute volume. Did you change any of the
options in between?

Pranith Kumar Karampuri

2017-05-11 15:32:16 UTC

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test. All
we have is our production hardware.

You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is part
of the problem.

I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts?
No need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Pat Haley

2017-05-11 16:02:38 UTC

Hi Pranith,

The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2

The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0

Will this cause a problem with creating a volume under /home?

Pat

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale
test. All we have is our production hardware.
You said something about /home partition which has lesser disks, we
can create plain distribute volume inside one of those directories.
After we are done, we can remove the setup. What do you say?
Pat

Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried
the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster
but not as fast as I was expecting given the 1.2 Gb/s to the
no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for, considering
your volume is just distribute. Is there any way you can do tests
on similar hardware but at a small scale? Just so we can run the
workload to learn more about the bottlenecks in the system? We
can probably try to get the speed to 1.2Gb/s on your /home
partition you were telling me yesterday. Let me know if that is
something you are okay to do.
Pat

Post by Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of expertise).
I'll run your answer by some other people who are more
familiar with this.
I am also uncertain about how to interpret the results
when we also add the dd tests writing to the /home area
(no gluster, still on the same machine)
* dd test without oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple
tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks
while each brick of the gluster area is a RAID-6 of 32
disks, I would naively expect the writes to the gluster
area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using
nfs without any gluster to a location that is not inside the
brick but someother location that is on same disk(s). If you
are mounting the partition as the brick, then we can write
to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if
fuse vs nfs is part of the problem.
I got interested in the post because I read that fuse speed
is lesser than nfs speed which is counter-intuitive to my
understanding. So wanted clarifications. Now that I got my
clarifications where fuse outperformed nfs without sync, we
can resume testing as described above and try to find what
it is. Based on your email-id I am guessing you are from
Boston and I am from Bangalore so if you are okay with doing
this debugging for multiple days because of timezones, I
will be happy to help. Please be a bit patient with me, I am
under a release crunch but I am very curious with the
problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are
collecting the profiles from an active volume, so it has a
lot of information that is not pertaining to dd so it is
difficult to find the contributions of dd. So I went through
your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on
my setup with FUSE so sent that reply.
Pat

Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts. Handling
O_SYNC in gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount
then each write syscall has to be written to disk as
part of the syscall where as in case of NFS, there is
no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write()
syscall is performed first then it performs fsync(). so
an write on an fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple
writes are batched together to be written do disk so
the throughput on the disk is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single test of
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s

Post by Pranith Kumar Karampuri
Could you let me know the speed without oflag=sync
on both the mounts? No need to collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread from 2016.
This is distribute volume. Did you change any
of the options in between?

Pranith Kumar Karampuri

2017-05-11 16:06:14 UTC

Post by Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?

I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?

I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.

I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case
of NFS, there is no concept of open. NFS performs write though a handle
saying it needs to be a synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Pat Haley

2017-05-12 14:34:04 UTC

Hi Pranith,

My question was about setting up a gluster volume on an ext4 partition.
I thought we had the bricks mounted as xfs for compatibility with gluster?

Pat

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small
scale test. All we have is our production hardware.
You said something about /home partition which has lesser disks,
we can create plain distribute volume inside one of those
directories. After we are done, we can remove the setup. What do
you say?
Pat

Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I
tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster
than gluster but not as fast as I was expecting given
the 1.2 Gb/s to the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is there any way
you can do tests on similar hardware but at a small scale?
Just so we can run the workload to learn more about the
bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me
yesterday. Let me know if that is something you are okay to do.
Pat

On Wed, May 10, 2017 at 10:15 PM, Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of
expertise). I'll run your answer by some other
people who are more familiar with this.
I am also uncertain about how to interpret the
results when we also add the dd tests writing to
the /home area (no gluster, still on the same machine)
* dd test without oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4
disks while each brick of the gluster area is a
RAID-6 of 32 disks, I would naively expect the
writes to the gluster area to be roughly 8x faster
than to the non-gluster.
I think a better test is to try and write to a file
using nfs without any gluster to a location that is not
inside the brick but someother location that is on same
disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell
if fuse vs nfs is part of the problem.
I got interested in the post because I read that fuse
speed is lesser than nfs speed which is
counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where
fuse outperformed nfs without sync, we can resume
testing as described above and try to find what it is.
Based on your email-id I am guessing you are from
Boston and I am from Bangalore so if you are okay with
doing this debugging for multiple days because of
timezones, I will be happy to help. Please be a bit
patient with me, I am under a release crunch but I am
very curious with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we
are collecting the profiles from an active volume, so
it has a lot of information that is not pertaining to
dd so it is difficult to find the contributions of dd.
So I went through your post again and found something I
didn't pay much attention to earlier i.e. oflag=sync,
so did my own tests on my setup with FUSE so sent that
reply.
Pat

Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts.
Handling O_SYNC in gluster NFS and fuse is a bit
different.
When application opens a file with O_SYNC on fuse
mount then each write syscall has to be written to
disk as part of the syscall where as in case of
NFS, there is no concept of open. NFS performs
write though a handle saying it needs to be a
synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple
writes are batched together to be written do disk
so the throughput on the disk is increasing is my
guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single test
mseas-data2(dri_nascar)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s,
575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s,
376 MB/s
On 05/10/2017 11:53 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Could you let me know the speed without
oflag=sync on both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread from
2016. This is distribute volume. Did you
change any of the options in between?

Pranith Kumar Karampuri

2017-05-13 03:14:12 UTC

Post by Pat Haley
Hi Pranith,
My question was about setting up a gluster volume on an ext4 partition. I
thought we had the bricks mounted as xfs for compatibility with gluster?

Oh that should not be a problem. It works fine.

I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?

Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.

I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.

I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Pranith Kumar Karampuri

2017-05-13 03:17:11 UTC

Post by Pranith Kumar Karampuri

On Sat, May 13, 2017 at 8:44 AM, Pranith Kumar Karampuri <

Post by Pat Haley
Hi Pranith,
My question was about setting up a gluster volume on an ext4 partition.
I thought we had the bricks mounted as xfs for compatibility with gluster?

Oh that should not be a problem. It works fine.

Just that xfs doesn't have limits for anything, where as ext4 does for
things like hardlinks etc(At least last time I checked :-) ). So it is
better to have xfs.

Post by Pranith Kumar Karampuri

I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?

Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.

I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.

I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Ben Turner

2017-05-15 01:24:53 UTC

----- Original Message -----

Sent: Friday, May 12, 2017 11:17:11 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
On Sat, May 13, 2017 at 8:44 AM, Pranith Kumar Karampuri <
Hi Pranith,
My question was about setting up a gluster volume on an ext4 partition. I
thought we had the bricks mounted as xfs for compatibility with gluster?
Oh that should not be a problem. It works fine.
Just that xfs doesn't have limits for anything, where as ext4 does for things
like hardlinks etc(At least last time I checked :-) ). So it is better to
have xfs.

One of the biggest reasons to use XFS IMHO is that most of the testing / large scale deployments(at least that I know of) / etc are done using XFS as a backend. While EXT4 should work I don't think that it has the same level of testing as XFS.

-b

Pat
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume under /home?
I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?
Pat
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we are
done, we can remove the setup. What do you say?
Pat
Hi Pranith,
Since we are mounting the partitions as the bricks, I tried the dd test
writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The
results without oflag=sync were 1.6 Gb/s (faster than gluster but not as
fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer
disks).
Okay, then 1.6Gb/s is what we need to target for, considering your volume is
just distribute. Is there any way you can do tests on similar hardware but
at a small scale? Just so we can run the workload to learn more about the
bottlenecks in the system? We can probably try to get the speed to 1.2Gb/s
on your /home partition you were telling me yesterday. Let me know if that
is something you are okay to do.
Pat
Hi Pranith,
Not entirely sure (this isn't my area of expertise). I'll run your answer by
some other people who are more familiar with this.
I am also uncertain about how to interpret the results when we also add the
dd tests writing to the /home area (no gluster, still on the same machine)
* dd test without oflag=sync (rough average of multiple tests)
* gluster w/ fuse mount : 570 Mb/s
* gluster w/ nfs mount: 390 Mb/s
* nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of multiple tests)
* gluster w/ fuse mount: 5 Mb/s
* gluster w/ nfs mount: 200 Mb/s
* nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of 4 disks while each brick of
the gluster area is a RAID-6 of 32 disks, I would naively expect the writes
to the gluster area to be roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a file using nfs without any
gluster to a location that is not inside the brick but someother location
that is on same disk(s). If you are mounting the partition as the brick,
then we can write to a file inside .glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't tell if fuse vs nfs is part of
the problem.
I got interested in the post because I read that fuse speed is lesser than
nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed nfs
without sync, we can resume testing as described above and try to find what
it is. Based on your email-id I am guessing you are from Boston and I am
from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting the
profiles from an active volume, so it has a lot of information that is not
pertaining to dd so it is difficult to find the contributions of dd. So I
went through your post again and found something I didn't pay much attention
to earlier i.e. oflag=sync, so did my own tests on my setup with FUSE so
sent that reply.
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in gluster NFS
and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each write
syscall has to be written to disk as part of the syscall where as in case of
NFS, there is no concept of open. NFS performs write though a handle saying
it needs to be a synchronous write, so write() syscall is performed first
then it performs fsync(). so an write on an fd with O_SYNC becomes
write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?
Without the oflag=sync and only a single test of each, the FUSE is going
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the mounts? No
need to collect profiles.
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
Is this the volume info you have?

[ root at mseas-data2 ~]# gluster volume info > > Volume Name: data-volume

Type: Distribute > Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >

Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: >
Brick1: mseas-data2:/mnt/brick1 > Brick2: mseas-data2:/mnt/brick2 >
Options Reconfigured: > performance.readdir-ahead: on > nfs.disable: on >
nfs.export-volumes: off

I copied this from old thread from 2016. This is distribute volume. Did you
change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith
--
Pranith
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Pat Haley

2017-05-16 15:50:35 UTC

Hi Pranith,

Sorry for the delay. I never saw received your reply (but I did receive
Ben Turner's follow-up to your reply). So we tried to create a gluster
volume under /home using different variations of

gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp

However we keep getting errors of the form

Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>

Any thoughts on what we're doing wrong?

Also do you have a list of the test we should be running once we get
this volume created? Given the time-zone difference it might help if we
can run a small battery of tests and post the results rather than
test-post-new test-post... .

Thanks

Pat

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small
scale test. All we have is our production hardware.
You said something about /home partition which has lesser disks,
we can create plain distribute volume inside one of those
directories. After we are done, we can remove the setup. What do
you say?
Pat

Post by Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks, I
tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s (faster
than gluster but not as fast as I was expecting given
the 1.2 Gb/s to the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is there any way
you can do tests on similar hardware but at a small scale?
Just so we can run the workload to learn more about the
bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me
yesterday. Let me know if that is something you are okay to do.
Pat

Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts.
Handling O_SYNC in gluster NFS and fuse is a bit
different.
When application opens a file with O_SYNC on fuse
mount then each write syscall has to be written to
disk as part of the syscall where as in case of
NFS, there is no concept of open. NFS performs
write though a handle saying it needs to be a
synchronous write, so write() syscall is performed
first then it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple
writes are batched together to be written do disk
so the throughput on the disk is increasing is my
guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single test
mseas-data2(dri_nascar)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s,
575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s,
376 MB/s
On 05/10/2017 11:53 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Could you let me know the speed without
oflag=sync on both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat Haley
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

Pranith Kumar Karampuri

2017-05-17 09:01:04 UTC

You should give transport tcp at the beginning I think. Anyways, transport
tcp is the default, so no need to specify so remove those two words from
the CLI.

Post by Pat Haley
Also do you have a list of the test we should be running once we get this
volume created? Given the time-zone difference it might help if we can run
a small battery of tests and post the results rather than test-post-new
test-post... .

This is the first time I am doing performance analysis on users as far as I
remember. In our team there are separate engineers who do these tests. Ben
who replied earlier is one such engineer.

Ben,
Have any suggestions?

Post by Pat Haley
Thanks
Pat

I don't think the bottleneck is disk. You can do the same tests you did on
your new volume to confirm?

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?

Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.

I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.

I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute volume.
Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Pat Haley

2017-05-30 15:46:18 UTC

Hi Pranith,

Thanks for the tip. We now have the gluster volume mounted under
/home. What tests do you recommend we run?

Thanks

Pat

Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I did
receive Ben Turner's follow-up to your reply). So we tried to
create a gluster volume under /home using different variations of
gluster volume create test-volume mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove those
two words from the CLI.
Also do you have a list of the test we should be running once we
get this volume created? Given the time-zone difference it might
help if we can run a small battery of tests and post the results
rather than test-post-new test-post... .
This is the first time I am doing performance analysis on users as far
as I remember. In our team there are separate engineers who do these
tests. Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a
small scale test. All we have is our production hardware.
You said something about /home partition which has lesser
disks, we can create plain distribute volume inside one of
those directories. After we are done, we can remove the
setup. What do you say?
Pat

On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as the bricks,
I tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s
(faster than gluster but not as fast as I was
expecting given the 1.2 Gb/s to the no-gluster area
w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is there
any way you can do tests on similar hardware but at a
small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on your /home
partition you were telling me yesterday. Let me know if
that is something you are okay to do.
Pat

On Wed, May 10, 2017 at 10:15 PM, Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of
expertise). I'll run your answer by some other
people who are more familiar with this.
I am also uncertain about how to interpret the
results when we also add the dd tests writing
to the /home area (no gluster, still on the
same machine)
* dd test without oflag=sync (rough average
of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough average of
multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a RAID-6 of
4 disks while each brick of the gluster area
is a RAID-6 of 32 disks, I would naively
expect the writes to the gluster area to be
roughly 8x faster than to the non-gluster.
I think a better test is to try and write to a
file using nfs without any gluster to a location
that is not inside the brick but someother
location that is on same disk(s). If you are
mounting the partition as the brick, then we can
write to a file inside .glusterfs directory,
something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I can't
tell if fuse vs nfs is part of the problem.
I got interested in the post because I read that
fuse speed is lesser than nfs speed which is
counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications
where fuse outperformed nfs without sync, we can
resume testing as described above and try to find
what it is. Based on your email-id I am guessing
you are from Boston and I am from Bangalore so if
you are okay with doing this debugging for
multiple days because of timezones, I will be
happy to help. Please be a bit patient with me, I
am under a release crunch but I am very curious
with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I
think we are collecting the profiles from an
active volume, so it has a lot of information that
is not pertaining to dd so it is difficult to find
the contributions of dd. So I went through your
post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my
own tests on my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith Kumar

Post by Pranith Kumar Karampuri
Okay good. At least this validates my doubts.
Handling O_SYNC in gluster NFS and fuse is a
bit different.
When application opens a file with O_SYNC on
fuse mount then each write syscall has to be
written to disk as part of the syscall where
as in case of NFS, there is no concept of
open. NFS performs write though a handle
saying it needs to be a synchronous write, so
write() syscall is performed first then it
performs fsync(). so an write on an fd with
O_SYNC becomes write+fsync. I am suspecting
that when multiple threads do this
write+fsync() operation on the same file,
multiple writes are batched together to be
written do disk so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat Haley
Without the oflag=sync and only a single
test of each, the FUSE is going faster
mseas-data2(dri_nascar)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961
s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264
s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Could you let me know the speed without
oflag=sync on both the mounts? No need
to collect profiles.
On Wed, May 10, 2017 at 9:17 PM, Pat
info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

Pranith Kumar Karampuri

2017-05-30 16:10:56 UTC

Let's start with the same 'dd' test we were testing with to see, what the
numbers are. Please provide profile numbers for the same. From there on we
will start tuning the volume to see what we can do.

Post by Pat Haley
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under /home.
What tests do you recommend we run?
Thanks
Pat

You should give transport tcp at the beginning I think. Anyways, transport
tcp is the default, so no need to specify so remove those two words from
the CLI.

This is the first time I am doing performance analysis on users as far as
I remember. In our team there are separate engineers who do these tests.
Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?

Post by Pat Haley
Thanks
Pat

I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?

Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.

I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.

I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Pat Haley

2017-05-30 17:06:51 UTC

Hi Pranith,

I ran the same 'dd' test both in the gluster test volume and in the
.glusterfs directory of each brick. The median results (12 dd trials in
each test) are similar to before

* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s

The profile for the gluster test-volume is in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt

Thanks

Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see, what
the numbers are. Please provide profile numbers for the same. From
there on we will start tuning the volume to see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under
/home. What tests do you recommend we run?
Thanks
Pat

Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply (but I
did receive Ben Turner's follow-up to your reply). So we
tried to create a gluster volume under /home using different
variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove
those two words from the CLI.
Also do you have a list of the test we should be running once
we get this volume created? Given the time-zone difference
it might help if we can run a small battery of tests and post
the results rather than test-post-new test-post... .
This is the first time I am doing performance analysis on users
as far as I remember. In our team there are separate engineers
who do these tests. Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a
small scale test. All we have is our production
hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute volume
inside one of those directories. After we are done, we
can remove the setup. What do you say?
Pat

On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as the
bricks, I tried the dd test writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6 Gb/s
(faster than gluster but not as fast as I was
expecting given the 1.2 Gb/s to the no-gluster
area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target for,
considering your volume is just distribute. Is
there any way you can do tests on similar hardware
but at a small scale? Just so we can run the
workload to learn more about the bottlenecks in
the system? We can probably try to get the speed
to 1.2Gb/s on your /home partition you were
telling me yesterday. Let me know if that is
something you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar

On Wed, May 10, 2017 at 10:15 PM, Pat Haley
Hi Pranith,
Not entirely sure (this isn't my area of
expertise). I'll run your answer by some
other people who are more familiar with this.
I am also uncertain about how to
interpret the results when we also add
the dd tests writing to the /home area
(no gluster, still on the same machine)
* dd test without oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick of the
gluster area is a RAID-6 of 32 disks, I
would naively expect the writes to the
gluster area to be roughly 8x faster than
to the non-gluster.
I think a better test is to try and write to
a file using nfs without any gluster to a
location that is not inside the brick but
someother location that is on same disk(s).
If you are mounting the partition as the
brick, then we can write to a file inside
.glusterfs directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue, I
can't tell if fuse vs nfs is part of the
problem.
I got interested in the post because I read
that fuse speed is lesser than nfs speed
which is counter-intuitive to my
understanding. So wanted clarifications. Now
that I got my clarifications where fuse
outperformed nfs without sync, we can resume
testing as described above and try to find
what it is. Based on your email-id I am
guessing you are from Boston and I am from
Bangalore so if you are okay with doing this
debugging for multiple days because of
timezones, I will be happy to help. Please be
a bit patient with me, I am under a release
crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I
think we are collecting the profiles from an
active volume, so it has a lot of information
that is not pertaining to dd so it is
difficult to find the contributions of dd. So
I went through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own tests
on my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith Kumar

Post by Pranith Kumar Karampuri
Okay good. At least this validates my
doubts. Handling O_SYNC in gluster NFS
and fuse is a bit different.
When application opens a file with
O_SYNC on fuse mount then each write
syscall has to be written to disk as
part of the syscall where as in case of
NFS, there is no concept of open. NFS
performs write though a handle saying it
needs to be a synchronous write, so
write() syscall is performed first then
it performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync. I am
suspecting that when multiple threads do
this write+fsync() operation on the same
file, multiple writes are batched
together to be written do disk so the
throughput on the disk is increasing is
my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM, Pat
Without the oflag=sync and only a
single test of each, the FUSE is
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied,
7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero
count=4096 bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied,
11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith

Post by Pranith Kumar Karampuri
Could you let me know the speed
without oflag=sync on both the
mounts? No need to collect profiles.
On Wed, May 10, 2017 at 9:17 PM,
volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.exports-auth-enable: on
WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM, Pranith

Post by Pranith Kumar Karampuri
Is this the volume info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old thread
from 2016. This is distribute
volume. Did you change any of
the options in between?

Pranith Kumar Karampuri

2017-05-31 01:27:26 UTC

Pat,
What is the command you used? As per the following output, it seems
like at least one write operation took 16 seconds. Which is really bad.

96.39 1165.10 us 89.00 us *16487014.00 us* 393212
WRITE

Post by Pat Haley
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in the
.glusterfs directory of each brick. The median results (12 dd trials in
each test) are similar to before
- gluster test volume: 586.5 MB/s
- bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/
profile_testvol_gluster.txt
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see, what the
numbers are. Please provide profile numbers for the same. From there on we
will start tuning the volume to see what we can do.

Post by Pat Haley
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under /home.
What tests do you recommend we run?
Thanks
Pat

You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove those two
words from the CLI.

Post by Pat Haley
Also do you have a list of the test we should be running once we get
this volume created? Given the time-zone difference it might help if we
can run a small battery of tests and post the results rather than
test-post-new test-post... .

This is the first time I am doing performance analysis on users as far as
I remember. In our team there are separate engineers who do these tests.
Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?

Post by Pat Haley
Thanks
Pat

I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale test.
All we have is our production hardware.

You said something about /home partition which has lesser disks, we can
create plain distribute volume inside one of those directories. After we
are done, we can remove the setup. What do you say?

Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.

I think a better test is to try and write to a file using nfs without
any gluster to a location that is not inside the brick but someother
location that is on same disk(s). If you are mounting the partition as the
brick, then we can write to a file inside .glusterfs directory, something
like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.

I got interested in the post because I read that fuse speed is lesser
than nfs speed which is counter-intuitive to my understanding. So wanted
clarifications. Now that I got my clarifications where fuse outperformed
nfs without sync, we can resume testing as described above and try to find
what it is. Based on your email-id I am guessing you are from Boston and I
am from Bangalore so if you are okay with doing this debugging for multiple
days because of timezones, I will be happy to help. Please be a bit patient
with me, I am under a release crunch but I am very curious with the problem
you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are collecting
the profiles from an active volume, so it has a lot of information that is
not pertaining to dd so it is difficult to find the contributions of dd. So
I went through your post again and found something I didn't pay much
attention to earlier i.e. oflag=sync, so did my own tests on my setup with
FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Pat Haley

2017-05-31 01:40:34 UTC

Hi Pranith,

The "dd" command was:

dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync

There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt

Pat

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat

Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat

On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a volume
under /home?
I don't think the bottleneck is disk. You can do the
same tests you did on your new volume to confirm?
Pat

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar

On Wed, May 10, 2017 at 10:15 PM, Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run your
answer by some other people who are
more familiar with this.
I am also uncertain about how to
interpret the results when we also
add the dd tests writing to the
/home area (no gluster, still on the
same machine)
* dd test without oflag=sync
(rough average of multiple tests)
o gluster w/ fuse mount : 570 Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of
32 disks, I would naively expect the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without any
gluster to a location that is not inside
the brick but someother location that is
on same disk(s). If you are mounting the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue,
I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I
read that fuse speed is lesser than nfs
speed which is counter-intuitive to my
understanding. So wanted clarifications.
Now that I got my clarifications where
fuse outperformed nfs without sync, we
can resume testing as described above
and try to find what it is. Based on
your email-id I am guessing you are from
Boston and I am from Bangalore so if you
are okay with doing this debugging for
multiple days because of timezones, I
will be happy to help. Please be a bit
patient with me, I am under a release
crunch but I am very curious with the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help me
much, I think we are collecting the
profiles from an active volume, so it
has a lot of information that is not
pertaining to dd so it is difficult to
find the contributions of dd. So I went
through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own
tests on my setup with FUSE so sent that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith

Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith

Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync on
both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
diagnostics.count-fop-hits: on
on
nfs.exports-auth-enable: on
WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM,

Post by Pranith Kumar Karampuri
Is this the volume info
you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18
on />/nfs.disable: on />/nfs.export-volumes: off /
âI copied this from old
thread from 2016. This is
distribute volume. Did
you change any of the
options in between?

Pranith Kumar Karampuri

2017-05-31 01:54:34 UTC

Thanks this is good information.

+Soumya

Soumya,
We are trying to find why kNFS is performing way better than plain
distribute glusterfs+fuse. What information do you think will benefit us to
compare the operations with kNFS vs gluster+fuse? We already have profile
output from fuse.

Post by Pat Haley
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/
dd_testvol_gluster.txt
Pat
Pat,
What is the command you used? As per the following output, it seems
like at least one write operation took 16 seconds. Which is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us* 393212 WRITE

Post by Pat Haley
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in the
.glusterfs directory of each brick. The median results (12 dd trials in
each test) are similar to before
- gluster test volume: 586.5 MB/s
- bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/pr
ofile_testvol_gluster.txt
Thanks
Pat
Let's start with the same 'dd' test we were testing with to see, what the
numbers are. Please provide profile numbers for the same. From there on we
will start tuning the volume to see what we can do.

Post by Pat Haley
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted under
/home. What tests do you recommend we run?
Thanks
Pat

You should give transport tcp at the beginning I think. Anyways,
transport tcp is the default, so no need to specify so remove those two
words from the CLI.

Post by Pat Haley
Also do you have a list of the test we should be running once we get
this volume created? Given the time-zone difference it might help if we
can run a small battery of tests and post the results rather than
test-post-new test-post... .

This is the first time I am doing performance analysis on users as far
as I remember. In our team there are separate engineers who do these tests.
Ben who replied earlier is one such engineer.
Ben,
Have any suggestions?

Post by Pat Haley
Thanks
Pat

I don't think the bottleneck is disk. You can do the same tests you did
on your new volume to confirm?

Post by Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware for a small scale
test. All we have is our production hardware.

You said something about /home partition which has lesser disks, we
can create plain distribute volume inside one of those directories. After
we are done, we can remove the setup. What do you say?

Okay, then 1.6Gb/s is what we need to target for, considering your
volume is just distribute. Is there any way you can do tests on similar
hardware but at a small scale? Just so we can run the workload to learn
more about the bottlenecks in the system? We can probably try to get the
speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let
me know if that is something you are okay to do.

I think a better test is to try and write to a file using nfs
without any gluster to a location that is not inside the brick but
someother location that is on same disk(s). If you are mounting the
partition as the brick, then we can write to a file inside .glusterfs
directory, something like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.

Post by Pat Haley
I still think we have a speed issue, I can't tell if fuse vs nfs is
part of the problem.

I got interested in the post because I read that fuse speed is
lesser than nfs speed which is counter-intuitive to my understanding. So
wanted clarifications. Now that I got my clarifications where fuse
outperformed nfs without sync, we can resume testing as described above and
try to find what it is. Based on your email-id I am guessing you are from
Boston and I am from Bangalore so if you are okay with doing this debugging
for multiple days because of timezones, I will be happy to help. Please be
a bit patient with me, I am under a release crunch but I am very curious
with the problem you posted.
Was there anything useful in the profiles?
Unfortunately profiles didn't help me much, I think we are
collecting the profiles from an active volume, so it has a lot of
information that is not pertaining to dd so it is difficult to find the
contributions of dd. So I went through your post again and found something
I didn't pay much attention to earlier i.e. oflag=sync, so did my own tests
on my setup with FUSE so sent that reply.

Post by Pat Haley
Pat
Okay good. At least this validates my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit different.
When application opens a file with O_SYNC on fuse mount then each
write syscall has to be written to disk as part of the syscall where as in
case of NFS, there is no concept of open. NFS performs write though a
handle saying it needs to be a synchronous write, so write() syscall is
performed first then it performs fsync(). so an write on an fd with O_SYNC
becomes write+fsync. I am suspecting that when multiple threads do this
write+fsync() operation on the same file, multiple writes are batched
together to be written do disk so the throughput on the disk is increasing
is my guess.
Does it answer your doubts?

Post by Pat Haley
Without the oflag=sync and only a single test of each, the FUSE is
mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s
Could you let me know the speed without oflag=sync on both the
mounts? No need to collect profiles.

* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info

*>>* Volume Name: data-volume
*>* Type: Distribute
*>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
*>* Status: Started
*>* Number of Bricks: 2
*>* Transport-type: tcp
*>* Brick1: mseas-data2:/mnt/brick1
*>* Brick2: mseas-data2:/mnt/brick2
*>* performance.readdir-ahead: on
*>* nfs.disable: on
*>* nfs.export-volumes: off
*
âI copied this from old thread from 2016. This is distribute
volume. Did you change any of the options in between?
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--

--
Pranith

Soumya Koduri

2017-05-31 10:56:26 UTC

Post by Pranith Kumar Karampuri
Thanks this is good information.
+Soumya
Soumya,
We are trying to find why kNFS is performing way better than
plain distribute glusterfs+fuse. What information do you think will
benefit us to compare the operations with kNFS vs gluster+fuse? We
already have profile output from fuse.

Could be because all operations done by kNFS are local to the system.
The operations done by FUSE mount over network could be more in number
and time-consuming than the ones sent by NFS-client. We could compare
and examine the pattern from tcpump taken over fuse-mount and NFS-mount.
Also nfsstat [1] may give some clue.

Sorry I hadn't followed this mail from the beginning. But is this
comparison between single brick volume and kNFS exporting that brick?
Otherwise its not a fair comparison if the volume is replicated or
distributed.

Thanks,
Soumya

[1] https://linux.die.net/man/8/nfsstat

Post by Pranith Kumar Karampuri
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from
the dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>
Pat

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it seems like at least one write operation took 16 seconds. Which
is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us* 393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in the .glusterfs directory of each brick. The median results
(12 dd trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see, what the numbers are. Please provide profile numbers for
the same. From there on we will start tuning the volume to
see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted under /home. What tests do you recommend we run?
Thanks
Pat

On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply (but I did receive Ben Turner's follow-up to
your reply). So we tried to create a gluster volume
under /home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify so remove those two words from the CLI.
Also do you have a list of the test we should be
running once we get this volume created? Given the
time-zone difference it might help if we can run a
small battery of tests and post the results rather
than test-post-new test-post... .
This is the first time I am doing performance analysis
on users as far as I remember. In our team there are
separate engineers who do these tests. Ben who replied
earlier is one such engineer.
Ben,
Have any suggestions?
Thanks
Pat

On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4
defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume under /home?
I don't think the bottleneck is disk. You can do
the same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which
has lesser disks, we can create plain
distribute volume inside one of those
directories. After we are done, we can remove
the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

On Thu, May 11, 2017 at 2:48 AM, Pat
Hi Pranith,
Since we are mounting the partitions
as the bricks, I tried the dd test
writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6 Gb/s (faster than gluster but not
as fast as I was expecting given the
1.2 Gb/s to the no-gluster area w/
fewer disks).
Okay, then 1.6Gb/s is what we need to
target for, considering your volume is
just distribute. Is there any way you can
do tests on similar hardware but at a
small scale? Just so we can run the
workload to learn more about the
bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s
on your /home partition you were telling
me yesterday. Let me know if that is
something you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar

On Wed, May 10, 2017 at 10:15 PM,
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your answer by some other people
who are more familiar with this.
I am also uncertain about how to
interpret the results when we
also add the dd tests writing to
the /home area (no gluster,
still on the same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570 Mb/s
390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync
(rough average of multiple
tests)
5 Mb/s
200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area
is a RAID-6 of 4 disks while
each brick of the gluster area
is a RAID-6 of 32 disks, I would
naively expect the writes to the
gluster area to be roughly 8x
faster than to the non-gluster.
I think a better test is to try and
write to a file using nfs without
any gluster to a location that is
not inside the brick but someother
location that is on same disk(s). If
you are mounting the partition as
the brick, then we can write to a
file inside .glusterfs directory,
something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue, I can't tell if fuse vs
nfs is part of the problem.
I got interested in the post because
I read that fuse speed is lesser
than nfs speed which is
counter-intuitive to my
understanding. So wanted
clarifications. Now that I got my
clarifications where fuse
outperformed nfs without sync, we
can resume testing as described
above and try to find what it is.
Based on your email-id I am guessing
you are from Boston and I am from
Bangalore so if you are okay with
doing this debugging for multiple
days because of timezones, I will be
happy to help. Please be a bit
patient with me, I am under a
release crunch but I am very curious
with the problem you posted.
Was there anything useful in
the profiles?
Unfortunately profiles didn't help
me much, I think we are collecting
the profiles from an active volume,
so it has a lot of information that
is not pertaining to dd so it is
difficult to find the contributions
of dd. So I went through your post
again and found something I didn't
pay much attention to earlier i.e.
oflag=sync, so did my own tests on
my setup with FUSE so sent that reply.
Pat
On 05/10/2017 12:15 PM, Pranith

Post by Pranith Kumar Karampuri
Okay good. At least this
validates my doubts. Handling
O_SYNC in gluster NFS and fuse
is a bit different.
When application opens a file
with O_SYNC on fuse mount then
each write syscall has to be
written to disk as part of the
syscall where as in case of
NFS, there is no concept of
open. NFS performs write though
a handle saying it needs to be
a synchronous write, so write()
syscall is performed first then
it performs fsync(). so an
write on an fd with O_SYNC
becomes write+fsync. I am
suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
Without the oflag=sync and
only a single test of each,
the FUSE is going faster
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,

Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on both the mounts? No
need to collect profiles.
On Wed, May 10, 2017 at
9:17 PM, Pat Haley
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44
AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Is this the volume
info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
c162161e-2a2d-4dac-b015-f31fd89ceb18
on />/nfs.disable: on />/nfs.export-volumes: off /
I copied this from
old thread from 2016.
This is distribute
volume. Did you
change any of the
options in between?

Pat Haley

2017-05-31 14:03:32 UTC

Hi Soumya,

For the latest test we set up a test gluster volume consisting of 2
bricks both residing on an NFS disk (/home). The gluster volume is
neither replicated nor striped. The tests were performed on the server
hosting the disk, so no network was involved.

Addition details of the system are in
http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html
(note that here the tests are now all being done under the /home disk)

Pat

Post by Soumya Koduri

Could be because all operations done by kNFS are local to the system.
The operations done by FUSE mount over network could be more in number
and time-consuming than the ones sent by NFS-client. We could compare
and examine the pattern from tcpump taken over fuse-mount and
NFS-mount. Also nfsstat [1] may give some clue.
Sorry I hadn't followed this mail from the beginning. But is this
comparison between single brick volume and kNFS exporting that brick?
Otherwise its not a fair comparison if the volume is replicated or
distributed.
Thanks,
Soumya
[1] https://linux.die.net/man/8/nfsstat

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it seems like at least one write operation took 16 seconds. Which
is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us*
393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in the .glusterfs directory of each brick. The median results
(12 dd trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see, what the numbers are. Please provide profile numbers for
the same. From there on we will start tuning the volume to
see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted under /home. What tests do you recommend we run?
Thanks
Pat

On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4
defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume under /home?
I don't think the bottleneck is disk. You can do
the same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which
has lesser disks, we can create plain
distribute volume inside one of those
directories. After we are done, we can remove
the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Okay good. At least this
validates my doubts. Handling
O_SYNC in gluster NFS and fuse
is a bit different.
When application opens a file
with O_SYNC on fuse mount then
each write syscall has to be
written to disk as part of the
syscall where as in case of
NFS, there is no concept of
open. NFS performs write though
a handle saying it needs to be
a synchronous write, so write()
syscall is performed first then
it performs fsync(). so an
write on an fd with O_SYNC
becomes write+fsync. I am
suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
Without the oflag=sync and
only a single test of each,
the FUSE is going faster
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,

Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on both the mounts? No
need to collect profiles.
On Wed, May 10, 2017 at
9:17 PM, Pat Haley
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44
AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Is this the volume
info you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number of Bricks: 2
on />/nfs.export-volumes: off /
I copied this from
old thread from 2016.
This is distribute
volume. Did you
change any of the
options in between?

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat
Center for Ocean
Engineering Phone: (617) 253-6824
Dept. of Mechanical
Engineering Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room 5-213
http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

Pat Haley

2017-05-31 14:37:22 UTC

Hi Soumya,

What pattern should we be trying to view with the tcpump? Is a one
minute capture of a copy operation sufficient or are you looking for
something else?

Pat

Post by Soumya Koduri

Could be because all operations done by kNFS are local to the system.
The operations done by FUSE mount over network could be more in number
and time-consuming than the ones sent by NFS-client. We could compare
and examine the pattern from tcpump taken over fuse-mount and
NFS-mount. Also nfsstat [1] may give some clue.
Sorry I hadn't followed this mail from the beginning. But is this
comparison between single brick volume and kNFS exporting that brick?
Otherwise its not a fair comparison if the volume is replicated or
distributed.
Thanks,
Soumya
[1] https://linux.die.net/man/8/nfsstat

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it seems like at least one write operation took 16 seconds. Which
is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us*
393212 WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in the .glusterfs directory of each brick. The median results
(12 dd trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see, what the numbers are. Please provide profile numbers for
the same. From there on we will start tuning the volume to
see what we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted under /home. What tests do you recommend we run?
Thanks
Pat

On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4
defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume under /home?
I don't think the bottleneck is disk. You can do
the same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware for a small scale test. All we
have is our production hardware.
You said something about /home partition which
has lesser disks, we can create plain
distribute volume inside one of those
directories. After we are done, we can remove
the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Okay good. At least this
validates my doubts. Handling
O_SYNC in gluster NFS and fuse
is a bit different.
When application opens a file
with O_SYNC on fuse mount then
each write syscall has to be
written to disk as part of the
syscall where as in case of
NFS, there is no concept of
open. NFS performs write though
a handle saying it needs to be
a synchronous write, so write()
syscall is performed first then
it performs fsync(). so an
write on an fd with O_SYNC
becomes write+fsync. I am
suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so the throughput on the disk
is increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
Without the oflag=sync and
only a single test of each,
the FUSE is going faster
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,

Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on both the mounts? No
need to collect profiles.
On Wed, May 10, 2017 at
9:17 PM, Pat Haley
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44
AM, Pranith Kumar

Post by Pranith Kumar Karampuri
Is this the volume
info you have?

/[root at mseas-data2

Ben Turner

2017-06-02 05:07:28 UTC

Are you sure using conv=sync is what you want? I normally use conv=fdatasync, I'll look up the difference between the two and see if it affects your test.

-b

----- Original Message -----

Sent: Tuesday, May 30, 2017 9:40:34 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Pranith,
dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync
There were 2 instances where dd reported 22 seconds. The output from the
dd tests are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
Pat

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat

Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar

Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith

Post by Pranith Kumar Karampuri
Is this the volume info
you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number
of Bricks: 2
/>/Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on />/nfs.disable: on
/>/nfs.export-volumes: off /
I copied this from old
thread from 2016. This is
distribute volume. Did
you change any of the
options in between?

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

Pat Haley

2017-06-12 18:35:41 UTC

Hi Guys,

I was wondering what our next steps should be to solve the slow write times.

Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.

Thanks

Pat

Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use conv=fdatasync, I'll look up the difference between the two and see if it affects your test.
-b
----- Original Message -----

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat

Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar

Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith

Post by Pranith Kumar Karampuri
Is this the volume info
you have?

/[root at mseas-data2

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617) 253-6824
(617) 253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

Ben Turner

2017-06-12 20:28:09 UTC

----- Original Message -----

Sent: Monday, June 12, 2017 2:35:41 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Guys,
I was wondering what our next steps should be to solve the slow write times.
Recently I was debugging a large code and writing a lot of output at
every time step. When I tried writing to our gluster disks, it was
taking over a day to do a single time step whereas if I had the same
program (same hardware, network) write to our nfs disk the time per
time-step was about 45 minutes. What we are shooting for here would be
to have similar times to either gluster of nfs.

I can see in your test:

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt

You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB / sec} / #replicas{2} = 600). Gluster does client side replication so with replica 2 you will only ever see 1/2 the speed of your slowest part of the stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is normally a best case. Now in your output I do see the instances where you went down to 200 MB / sec. I can only explain this in three ways:

1. You are not using conv=fdatasync and writes are actually going to page cache and then being flushed to disk. During the fsync the memory is not yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN) and when write times are slow the RAID group is busy serviceing other LUNs.
3. Gluster bug / config issue / some other unknown unknown.

So I see 2 issues here:

1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.

WRT #1 - have a look at my estimates above. My formula for guestimating gluster perf is: throughput = NIC throughput or storage(whatever is slower) / # replicas * overhead(figure .7 or .8). Also the larger the record size the better for glusterfs mounts, I normally like to be at LEAST 64k up to 1024k:

# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync

WRT #2 - Again, I question your testing and your storage config. Try using conv=fdatasync for your DDs, use a larger record size, and make sure that your back end storage is not causing your slowdowns. Also remember that with replica 2 you will take ~50% hit on writes because the client uses 50% of its bandwidth to write to one replica and 50% to the other.

-b

Thanks
Pat

Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output, it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us* 393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and in
the .glusterfs directory of each brick. The median results (12 dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume mounted
under /home. What tests do you recommend we run?
Thanks
Pat

Post by Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to specify
so remove those two words from the CLI.
Also do you have a list of the test we should be running
once we get this volume created? Given the time-zone
difference it might help if we can run a small battery
of tests and post the results rather than test-post-new
test-post... .
This is the first time I am doing performance analysis on
users as far as I remember. In our team there are separate
engineers who do these tests. Ben who replied earlier is one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar hardware
for a small scale test. All we have is our
production hardware.
You said something about /home partition which has
lesser disks, we can create plain distribute
volume inside one of those directories. After we
are done, we can remove the setup. What do you say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

On Thu, May 11, 2017 at 2:48 AM, Pat Haley
Hi Pranith,
Since we are mounting the partitions as
the bricks, I tried the dd test writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were 1.6
Gb/s (faster than gluster but not as fast
as I was expecting given the 1.2 Gb/s to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to target
for, considering your volume is just
distribute. Is there any way you can do tests
on similar hardware but at a small scale?
Just so we can run the workload to learn more
about the bottlenecks in the system? We can
probably try to get the speed to 1.2Gb/s on
your /home partition you were telling me
yesterday. Let me know if that is something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar

On Wed, May 10, 2017 at 10:15 PM, Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run your
answer by some other people who are
more familiar with this.
I am also uncertain about how to
interpret the results when we also
add the dd tests writing to the
/home area (no gluster, still on the
same machine)
* dd test without oflag=sync
(rough average of multiple tests)
o gluster w/ fuse mount : 570
Mb/s
o gluster w/ nfs mount: 390 Mb/s
o nfs (no gluster): 1.2 Gb/s
* dd test with oflag=sync (rough
average of multiple tests)
o gluster w/ fuse mount: 5 Mb/s
o gluster w/ nfs mount: 200 Mb/s
o nfs (no gluster): 20 Mb/s
Given that the non-gluster area is a
RAID-6 of 4 disks while each brick
of the gluster area is a RAID-6 of
32 disks, I would naively expect the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without any
gluster to a location that is not inside
the brick but someother location that is
on same disk(s). If you are mounting the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed issue,
I can't tell if fuse vs nfs is part
of the problem.
I got interested in the post because I
read that fuse speed is lesser than nfs
speed which is counter-intuitive to my
understanding. So wanted clarifications.
Now that I got my clarifications where
fuse outperformed nfs without sync, we
can resume testing as described above
and try to find what it is. Based on
your email-id I am guessing you are from
Boston and I am from Bangalore so if you
are okay with doing this debugging for
multiple days because of timezones, I
will be happy to help. Please be a bit
patient with me, I am under a release
crunch but I am very curious with the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help me
much, I think we are collecting the
profiles from an active volume, so it
has a lot of information that is not
pertaining to dd so it is difficult to
find the contributions of dd. So I went
through your post again and found
something I didn't pay much attention to
earlier i.e. oflag=sync, so did my own
tests on my setup with FUSE so sent that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith

Post by Pranith Kumar Karampuri
Okay good. At least this validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file with
O_SYNC on fuse mount then each
write syscall has to be written to
disk as part of the syscall where
as in case of NFS, there is no
concept of open. NFS performs write
though a handle saying it needs to
be a synchronous write, so write()
syscall is performed first then it
performs fsync(). so an write on an
fd with O_SYNC becomes write+fsync.
I am suspecting that when multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35 PM,
Without the oflag=sync and only
a single test of each, the FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM, Pranith

Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync on
both the mounts? No need to
collect profiles.
On Wed, May 10, 2017 at 9:17
gluster volume info
Volume Name: data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
nfs.exports-auth-enable: on
WARNING
on
nfs.disable: on
nfs.export-volumes: off
On 05/10/2017 11:44 AM,
Pranith Kumar Karampuri

Post by Pranith Kumar Karampuri
Is this the volume info
you have?

/[root at mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
/>/Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on />/nfs.disable: on
/>/nfs.export-volumes: off
/
I copied this from old
thread from 2016. This is
distribute volume. Did
you change any of the
options in between?

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
Phone: (617) 253-6824
Dept. of Mechanical Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering
(617) 253-6824
Dept. of Mechanical Engineering
(617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

Pat Haley

2017-06-20 16:06:30 UTC

Hi Ben,

Sorry this took so long, but we had a real-time forecasting exercise
last week and I could only get to this now.

Backend Hardware/OS:

* Much of the information on our back end system is included at the
top of
http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html
* The specific model of the hard disks is SeaGate ENTERPRISE CAPACITY
V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
* Note: there is one physical server that hosts both the NFS and the
GlusterFS areas

Latest tests

I have had time to run the tests for one of the dd tests you requested
to the underlying XFS FS. The median rate was 170 MB/s. The dd results
and iostat record are in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/

I'll add tests for the other brick and to the NFS area later.

Thanks

Pat

throughput = slowest of disks / NIC * .6-.7
1200 * .6 = 720
-First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using?
-Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct?
# dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
# echo 3 > /proc/sys/vm/drop_caches
# dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
# iostat -c -m -x 1 > iostat-$(hostname).txt
Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster.
-Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here.
-b
----- Original Message -----

Sent: Monday, June 12, 2017 5:18:07 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
Volume Name: data-volume
Type: Distribute
Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: mseas-data2:/mnt/brick1
Brick2: mseas-data2:/mnt/brick2
nfs.exports-auth-enable: on
diagnostics.brick-sys-log-level: WARNING
performance.readdir-ahead: on
nfs.disable: on
nfs.export-volumes: off

What is the output of gluster v info? That will tell us more about your
config.
-b
----- Original Message -----

Sent: Monday, June 12, 2017 4:54:00 PM
Subject: Re: [Gluster-users] Slow write times to gluster disk
Hi Ben,
I guess I'm confused about what you mean by replication. If I look at
the underlying bricks I only ever have a single copy of any file. It
either resides on one brick or the other (directories exist on both
bricks but not files). We are not using gluster for redundancy (or at
least that wasn't our intent). Is that what you meant by replication
or is it something else?
Thanks
Pat

Post by Ben Turner
----- Original Message -----

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB /
sec} / #replicas{2} = 600). Gluster does client side replication so with
replica 2 you will only ever see 1/2 the speed of your slowest part of
the
stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is
normally
a best case. Now in your output I do see the instances where you went
1. You are not using conv=fdatasync and writes are actually going to
page
cache and then being flushed to disk. During the fsync the memory is not
yet available and the disks are busy flushing dirty pages.
2. Your storage RAID group is shared across multiple LUNS(like in a SAN)
and when write times are slow the RAID group is busy serviceing other
LUNs.
3. Gluster bug / config issue / some other unknown unknown.
1. NFS does in 45 minutes what gluster can do in 24 hours.
2. Sometimes your throughput drops dramatically.
WRT #1 - have a look at my estimates above. My formula for guestimating
gluster perf is: throughput = NIC throughput or storage(whatever is
slower) / # replicas * overhead(figure .7 or .8). Also the larger the
record size the better for glusterfs mounts, I normally like to be at
# dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync
WRT #2 - Again, I question your testing and your storage config. Try
using
conv=fdatasync for your DDs, use a larger record size, and make sure that
your back end storage is not causing your slowdowns. Also remember that
with replica 2 you will take ~50% hit on writes because the client uses
50% of its bandwidth to write to one replica and 50% to the other.
-b

Thanks
Pat

Post by Ben Turner
Are you sure using conv=sync is what you want? I normally use
conv=fdatasync, I'll look up the difference between the two and see if it
affects your test.
-b
----- Original Message -----

Post by Pranith Kumar Karampuri
Pat,
What is the command you used? As per the following output,
it
seems like at least one write operation took 16 seconds. Which is
really bad.
96.39 1165.10 us 89.00 us*16487014.00 us*
393212
WRITE
Hi Pranith,
I ran the same 'dd' test both in the gluster test volume and
in
the .glusterfs directory of each brick. The median results
(12
dd
trials in each test) are similar to before
* gluster test volume: 586.5 MB/s
* bricks (in .glusterfs): 1.4 GB/s
The profile for the gluster test-volume is in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
Thanks
Pat

Post by Pranith Kumar Karampuri
Let's start with the same 'dd' test we were testing with to
see,
what the numbers are. Please provide profile numbers for the
same. From there on we will start tuning the volume to see
what
we can do.
Hi Pranith,
Thanks for the tip. We now have the gluster volume
mounted
under /home. What tests do you recommend we run?
Thanks
Pat

On Tue, May 16, 2017 at 9:20 PM, Pat Haley
Hi Pranith,
Sorry for the delay. I never saw received your
reply
(but I did receive Ben Turner's follow-up to your
reply). So we tried to create a gluster volume
under
/home using different variations of
gluster volume create test-volume
mseas-data2:/home/gbrick_test_1
mseas-data2:/home/gbrick_test_2 transport tcp
However we keep getting errors of the form
Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>
Any thoughts on what we're doing wrong?
You should give transport tcp at the beginning I think.
Anyways, transport tcp is the default, so no need to
specify
so remove those two words from the CLI.
Also do you have a list of the test we should be
running
once we get this volume created? Given the
time-zone
difference it might help if we can run a small
battery
of tests and post the results rather than
test-post-new
test-post... .
This is the first time I am doing performance analysis
on
users as far as I remember. In our team there are
separate
engineers who do these tests. Ben who replied earlier is
one
such engineer.
Ben,
Have any suggestions?
Thanks
Pat
On 05/11/2017 12:06 PM, Pranith Kumar Karampuri

On Thu, May 11, 2017 at 9:32 PM, Pat Haley
Hi Pranith,
The /home partition is mounted as ext4
/home ext4 defaults,usrquota,grpquota 1 2
The brick partitions are mounted ax xfs
/mnt/brick1 xfs defaults 0 0
/mnt/brick2 xfs defaults 0 0
Will this cause a problem with creating a
volume
under /home?
I don't think the bottleneck is disk. You can do
the
same tests you did on your new volume to confirm?
Pat
On 05/11/2017 11:32 AM, Pranith Kumar Karampuri

On Thu, May 11, 2017 at 8:57 PM, Pat Haley
Hi Pranith,
Unfortunately, we don't have similar
hardware
for a small scale test. All we have is
our
production hardware.
You said something about /home partition which
has
lesser disks, we can create plain distribute
volume inside one of those directories. After
we
are done, we can remove the setup. What do you
say?
Pat
On 05/11/2017 07:05 AM, Pranith Kumar

On Thu, May 11, 2017 at 2:48 AM, Pat
Haley
Hi Pranith,
Since we are mounting the partitions
as
the bricks, I tried the dd test
writing
to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
The results without oflag=sync were
1.6
Gb/s (faster than gluster but not as
fast
as I was expecting given the 1.2 Gb/s
to
the no-gluster area w/ fewer disks).
Okay, then 1.6Gb/s is what we need to
target
for, considering your volume is just
distribute. Is there any way you can do
tests
on similar hardware but at a small scale?
Just so we can run the workload to learn
more
about the bottlenecks in the system? We
can
probably try to get the speed to 1.2Gb/s
on
your /home partition you were telling me
yesterday. Let me know if that is
something
you are okay to do.
Pat
On 05/10/2017 01:27 PM, Pranith Kumar

On Wed, May 10, 2017 at 10:15 PM,
Pat
Hi Pranith,
Not entirely sure (this isn't my
area of expertise). I'll run
your
answer by some other people who
are
more familiar with this.
I am also uncertain about how to
interpret the results when we
also
add the dd tests writing to the
/home area (no gluster, still on
the
same machine)
* dd test without oflag=sync
(rough average of multiple
tests)
570
Mb/s
390
Mb/s
o nfs (no gluster): 1.2
Gb/s
* dd test with oflag=sync
(rough
average of multiple tests)
5
Mb/s
200
Mb/s
o nfs (no gluster): 20
Mb/s
Given that the non-gluster area
is
a
RAID-6 of 4 disks while each
brick
of the gluster area is a RAID-6
of
32 disks, I would naively expect
the
writes to the gluster area to be
roughly 8x faster than to the
non-gluster.
I think a better test is to try and
write to a file using nfs without
any
gluster to a location that is not
inside
the brick but someother location
that
is
on same disk(s). If you are mounting
the
partition as the brick, then we can
write to a file inside .glusterfs
directory, something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
I still think we have a speed
issue,
I can't tell if fuse vs nfs is
part
of the problem.
I got interested in the post because
I
read that fuse speed is lesser than
nfs
speed which is counter-intuitive to
my
understanding. So wanted
clarifications.
Now that I got my clarifications
where
fuse outperformed nfs without sync,
we
can resume testing as described
above
and try to find what it is. Based on
your email-id I am guessing you are
from
Boston and I am from Bangalore so if
you
are okay with doing this debugging
for
multiple days because of timezones,
I
will be happy to help. Please be a
bit
patient with me, I am under a
release
crunch but I am very curious with
the
problem you posted.
Was there anything useful in the
profiles?
Unfortunately profiles didn't help
me
much, I think we are collecting the
profiles from an active volume, so
it
has a lot of information that is not
pertaining to dd so it is difficult
to
find the contributions of dd. So I
went
through your post again and found
something I didn't pay much
attention
to
earlier i.e. oflag=sync, so did my
own
tests on my setup with FUSE so sent
that
reply.
Pat
On 05/10/2017 12:15 PM, Pranith

Post by Pranith Kumar Karampuri
Okay good. At least this
validates
my doubts. Handling O_SYNC in
gluster NFS and fuse is a bit
different.
When application opens a file
with
O_SYNC on fuse mount then each
write syscall has to be written
to
disk as part of the syscall
where
as in case of NFS, there is no
concept of open. NFS performs
write
though a handle saying it needs
to
be a synchronous write, so
write()
syscall is performed first then
it
performs fsync(). so an write
on
an
fd with O_SYNC becomes
write+fsync.
I am suspecting that when
multiple
threads do this write+fsync()
operation on the same file,
multiple writes are batched
together to be written do disk
so
the throughput on the disk is
increasing is my guess.
Does it answer your doubts?
On Wed, May 10, 2017 at 9:35
PM,
Without the oflag=sync and
only
a single test of each, the
FUSE
mseas-data2(dri_nascar)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 7.46961 s, 575 MB/s
NFS
mseas-data2(HYCOM)% dd
if=/dev/zero count=4096
bs=1048576 of=zeros.txt
conv=sync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB)
copied, 11.4264 s, 376 MB/s
On 05/10/2017 11:53 AM,
Pranith

Post by Pranith Kumar Karampuri
Could you let me know the
speed without oflag=sync
on
both the mounts? No need
to
collect profiles.
On Wed, May 10, 2017 at
9:17
PM, Pat Haley
Here is what I see
gluster volume info
data-volume
Type: Distribute
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
Number of Bricks: 2
Transport-type: tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
on
on
on
WARNING
on
nfs.disable: on
off
On 05/10/2017 11:44
AM,
Pranith Kumar
Karampuri

Post by Pranith Kumar Karampuri
Is this the volume
info
you have?

/[root at
mseas-data2

<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume
info
Distribute />/Volume
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started
/>/Number
of Bricks: 2
tcp
mseas-data2:/mnt/brick1
mseas-data2:/mnt/brick2
/>/Options
on />/nfs.disable: on
off
/
âI copied this from
old
thread from 2016.
This
is
distribute volume.
Did
you change any of the
options in between?

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts
Avenue
Cambridge, MA
02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean
Engineering
Phone: (617) 253-6824
Dept. of Mechanical
Engineering
Fax: (617) 253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room
5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
(617)
253-6824
(617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley
Center for Ocean Engineering Phone: (617)
253-6824
Dept. of Mechanical Engineering Fax: (617)
253-8125
MIT, Room 5-213http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
--
Pranith

Pat Haley

2017-06-22 20:53:42 UTC