[Gluster-users] Disperse volumes on armhf

Discussion:

Fox

2018-08-03 04:21:30 UTC

Just wondering if anyone else is running into the same behavior with
disperse volumes described below and what I might be able to do about it.

I am using ubuntu 18.04LTS on Odroid HC-2 hardware (armhf) and have
installed gluster 4.1.2 via PPA. I have 12 member nodes each with a single
brick. I can successfully create a working volume via the command:

gluster volume create testvol1 disperse 12 redundancy 4
gluster01:/exports/sda/brick1/testvol1
gluster02:/exports/sda/brick1/testvol1
gluster03:/exports/sda/brick1/testvol1
gluster04:/exports/sda/brick1/testvol1
gluster05:/exports/sda/brick1/testvol1
gluster06:/exports/sda/brick1/testvol1
gluster07:/exports/sda/brick1/testvol1
gluster08:/exports/sda/brick1/testvol1
gluster09:/exports/sda/brick1/testvol1
gluster10:/exports/sda/brick1/testvol1
gluster11:/exports/sda/brick1/testvol1
gluster12:/exports/sda/brick1/testvol1

And start the volume:
gluster volume start testvol1

Mounting the volume on an x86-64 system it performs as expected.

Mounting the same volume on an armhf system (such as one of the cluster
members) I can create directories but trying to create a file I get an
error and the file system unmounts/crashes:
***@gluster01:~# mount -t glusterfs gluster01:/testvol1 /mnt
***@gluster01:~# cd /mnt
***@gluster01:/mnt# ls
***@gluster01:/mnt# mkdir test
***@gluster01:/mnt# cd test
***@gluster01:/mnt/test# cp /root/notes.txt ./
cp: failed to close './notes.txt': Software caused connection abort
***@gluster01:/mnt/test# ls
ls: cannot open directory '.': Transport endpoint is not connected

I get many of these in the glusterfsd.log:
The message "W [MSGID: 101088] [common-utils.c:4316:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 100 times between
[2018-08-03 04:06:39.904166] and [2018-08-03 04:06:57.521895]

Furthermore, if a cluster member ducks out (reboots, loses connection, etc)
and needs healing the self heal daemon logs messages similar to that above
and can not heal - no disk activity (verified via iotop) though very high
CPU usage and the volume heal info command indicates the volume needs
healing.

I tested all of the above in virtual environments using x86-64 VMs and
could self heal as expected.

Again this only happens when using disperse volumes. Should I be filing a
bug report instead?

Ashish Pandey

2018-08-03 05:57:36 UTC

Permalink

Yes, you should file a bug to track this issue and to share information.
Also, I would like to have logs which are present in /var/log/messages, specially mount logs with name mnt.log or something.

Following are the points I would like to bring in to your notice-

1 - Are you sure that all the bricks are UP?
2 - Is there any connection issues?
3 - It is possible that there is a bug which caused crash. So please check for core dump created while doing mount and you saw ENOTCONN error.
4 - I am not very much aware of armhf and have not run glusterfs on this hardware. So, we need to see if there is anything in code which is
stopping us to run glusterfs on this architecture and setup.
5 - Please provide the output of gluster v info and gluster v status for the volume in BZ.

---
Ashish

----- Original Message -----

From: "Fox" <***@gmail.com>
To: gluster-***@gluster.org
Sent: Friday, August 3, 2018 9:51:30 AM
Subject: [Gluster-users] Disperse volumes on armhf

Just wondering if anyone else is running into the same behavior with disperse volumes described below and what I might be able to do about it.

I am using ubuntu 18.04LTS on Odroid HC-2 hardware (armhf) and have installed gluster 4.1.2 via PPA. I have 12 member nodes each with a single brick. I can successfully create a working volume via the command:

gluster volume create testvol1 disperse 12 redundancy 4 gluster01:/exports/sda/brick1/testvol1 gluster02:/exports/sda/brick1/testvol1 gluster03:/exports/sda/brick1/testvol1 gluster04:/exports/sda/brick1/testvol1 gluster05:/exports/sda/brick1/testvol1 gluster06:/exports/sda/brick1/testvol1 gluster07:/exports/sda/brick1/testvol1 gluster08:/exports/sda/brick1/testvol1 gluster09:/exports/sda/brick1/testvol1 gluster10:/exports/sda/brick1/testvol1 gluster11:/exports/sda/brick1/testvol1 gluster12:/exports/sda/brick1/testvol1

And start the volume:

gluster volume start testvol1

Mounting the volume on an x86-64 system it performs as expected.

Mounting the same volume on an armhf system (such as one of the cluster members) I can create directories but trying to create a file I get an error and the file system unmounts/crashes:
***@gluster01:~# mount -t glusterfs gluster01:/testvol1 /mnt
***@gluster01:~# cd /mnt
***@gluster01:/mnt# ls
***@gluster01:/mnt# mkdir test
***@gluster01:/mnt# cd test
***@gluster01:/mnt/test# cp /root/notes.txt ./
cp: failed to close './notes.txt': Software caused connection abort
***@gluster01:/mnt/test# ls
ls: cannot open directory '.': Transport endpoint is not connected

I get many of these in the glusterfsd.log:
The message "W [MSGID: 101088] [common-utils.c:4316:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 100 times between [2018-08-03 04:06:39.904166] and [2018-08-03 04:06:57.521895]

Furthermore, if a cluster member ducks out (reboots, loses connection, etc) and needs healing the self heal daemon logs messages similar to that above and can not heal - no disk activity (verified via iotop) though very high CPU usage and the volume heal info command indicates the volume needs healing.

I tested all of the above in virtual environments using x86-64 VMs and could self heal as expected.

Again this only happens when using disperse volumes. Should I be filing a bug report instead?

_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Milind Changire

2018-08-03 07:33:28 UTC

Permalink

What is the endianness of the armhf CPU ?
Are you running a 32bit or 64bit Operating System ?

Post by Fox
Just wondering if anyone else is running into the same behavior with
disperse volumes described below and what I might be able to do about it.
I am using ubuntu 18.04LTS on Odroid HC-2 hardware (armhf) and have
installed gluster 4.1.2 via PPA. I have 12 member nodes each with a single
gluster volume create testvol1 disperse 12 redundancy 4
gluster01:/exports/sda/brick1/testvol1 gluster02:/exports/sda/brick1/testvol1
gluster03:/exports/sda/brick1/testvol1 gluster04:/exports/sda/brick1/testvol1
gluster05:/exports/sda/brick1/testvol1 gluster06:/exports/sda/brick1/testvol1
gluster07:/exports/sda/brick1/testvol1 gluster08:/exports/sda/brick1/testvol1
gluster09:/exports/sda/brick1/testvol1 gluster10:/exports/sda/brick1/testvol1
gluster11:/exports/sda/brick1/testvol1 gluster12:/exports/sda/brick1/
testvol1
gluster volume start testvol1
Mounting the volume on an x86-64 system it performs as expected.
Mounting the same volume on an armhf system (such as one of the cluster
members) I can create directories but trying to create a file I get an
cp: failed to close './notes.txt': Software caused connection abort
ls: cannot open directory '.': Transport endpoint is not connected
The message "W [MSGID: 101088] [common-utils.c:4316:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 100 times between
[2018-08-03 04:06:39.904166] and [2018-08-03 04:06:57.521895]
Furthermore, if a cluster member ducks out (reboots, loses connection,
etc) and needs healing the self heal daemon logs messages similar to that
above and can not heal - no disk activity (verified via iotop) though very
high CPU usage and the volume heal info command indicates the volume needs
healing.
I tested all of the above in virtual environments using x86-64 VMs and
could self heal as expected.
Again this only happens when using disperse volumes. Should I be filing a
bug report instead?
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

Fox

2018-08-04 01:18:17 UTC

Permalink

Replying to the last batch of questions I've received...

To reiterate, I am only having problems writing files to disperse volumes
when mounting it on an armhf system. Mounting the same volume on an x86-64
system works fine.
Disperse volumes running on arm can not heal.

Replica volumes mount and heal just fine.

All bricks are up and running. I have ensured connectivity and that MTU is
correct and identical.

Armhf is 32bit:
# uname -a
Linux gluster01 4.14.55-146 #1 SMP PREEMPT Wed Jul 11 22:31:01 -03 2018
armv7l armv7l armv7l GNU/Linux
# file /bin/bash
/bin/bash: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (SYSV),
dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux
3.2.0, BuildID[sha1]=e0a53f804173b0cd9845bb8a76fee1a1e98a9759, stripped
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
# free
total used free shared buff/cache
available
Mem: 2042428 83540 1671004 6052 287884
1895684
Swap: 0 0 0

8 cores total. 4x running 2ghz and 4x running 1.4ghz
processor : 0
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 24.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva
idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3

processor : 4
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 72.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva
idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3

There IS a 98MB /core file from the fuse mount so thats cool.
# file /core
/core: ELF 32-bit LSB core file ARM, version 1 (SYSV), SVR4-style, from
'/usr/sbin/glusterfs --process-name fuse --volfile-server=gluster01
--volfile-id', real uid: 0, effective uid: 0, real gid: 0, effective gid:
0, execfn: '/usr/sbin/glusterfs', platform: 'v7l'

I will try and get a bug report with logs filed over the weekend.

This is just an experimental home cluster. I don't have anything on it yet.
Its possible I could grant someone SSH access to the cluster if it helps
further the gluster project. But the results should be reproducible on
something like a raspberry pi. I was hoping to run a dispersed volume on it
eventually otherwise I would have never found this issue.

Thank you for the troubleshooting ideas.

-Fox

Post by Milind Changire
What is the endianness of the armhf CPU ?
Are you running a 32bit or 64bit Operating System ?

--
Milind

Xavi Hernandez

2018-08-06 07:23:23 UTC

Permalink

Hi,

Post by Fox
Replying to the last batch of questions I've received...
To reiterate, I am only having problems writing files to disperse volumes
when mounting it on an armhf system. Mounting the same volume on an x86-64
system works fine.
Disperse volumes running on arm can not heal.
Replica volumes mount and heal just fine.
All bricks are up and running. I have ensured connectivity and that MTU is
correct and identical.
# uname -a
Linux gluster01 4.14.55-146 #1 SMP PREEMPT Wed Jul 11 22:31:01 -03 2018
armv7l armv7l armv7l GNU/Linux
# file /bin/bash
/bin/bash: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (SYSV),
dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux
3.2.0, BuildID[sha1]=e0a53f804173b0cd9845bb8a76fee1a1e98a9759, stripped
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
# free
total used free shared buff/cache
available
Mem: 2042428 83540 1671004 6052 287884
1895684
Swap: 0 0 0
8 cores total. 4x running 2ghz and 4x running 1.4ghz
processor : 0
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 24.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva
idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 3
processor : 4
model name : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 72.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva
idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc0f
CPU revision : 3
There IS a 98MB /core file from the fuse mount so thats cool.
# file /core
/core: ELF 32-bit LSB core file ARM, version 1 (SYSV), SVR4-style, from
'/usr/sbin/glusterfs --process-name fuse --volfile-server=gluster01
0, execfn: '/usr/sbin/glusterfs', platform: 'v7l'

On possible cause is some 64/32 bits inconsistency. If you have also
installed the debug symbols and can provide a backtrace from the core dump,
it would help to identify the problem.

Xavi

Post by Fox
I will try and get a bug report with logs filed over the weekend.
This is just an experimental home cluster. I don't have anything on it
yet. Its possible I could grant someone SSH access to the cluster if it
helps further the gluster project. But the results should be reproducible
on something like a raspberry pi. I was hoping to run a dispersed volume on
it eventually otherwise I would have never found this issue.
Thank you for the troubleshooting ideas.
-Fox

Post by Milind Changire
What is the endianness of the armhf CPU ?
Are you running a 32bit or 64bit Operating System ?

Post by Fox
Just wondering if anyone else is running into the same behavior with
disperse volumes described below and what I might be able to do about it.
I am using ubuntu 18.04LTS on Odroid HC-2 hardware (armhf) and have
installed gluster 4.1.2 via PPA. I have 12 member nodes each with a single
gluster volume create testvol1 disperse 12 redundancy 4
gluster01:/exports/sda/brick1/testvol1
gluster02:/exports/sda/brick1/testvol1
gluster03:/exports/sda/brick1/testvol1
gluster04:/exports/sda/brick1/testvol1
gluster05:/exports/sda/brick1/testvol1
gluster06:/exports/sda/brick1/testvol1
gluster07:/exports/sda/brick1/testvol1
gluster08:/exports/sda/brick1/testvol1
gluster09:/exports/sda/brick1/testvol1
gluster10:/exports/sda/brick1/testvol1
gluster11:/exports/sda/brick1/testvol1
gluster12:/exports/sda/brick1/testvol1
gluster volume start testvol1
Mounting the volume on an x86-64 system it performs as expected.
Mounting the same volume on an armhf system (such as one of the cluster
members) I can create directories but trying to create a file I get an
cp: failed to close './notes.txt': Software caused connection abort
ls: cannot open directory '.': Transport endpoint is not connected
The message "W [MSGID: 101088] [common-utils.c:4316:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 100 times between
[2018-08-03 04:06:39.904166] and [2018-08-03 04:06:57.521895]
Furthermore, if a cluster member ducks out (reboots, loses connection,
etc) and needs healing the self heal daemon logs messages similar to that
above and can not heal - no disk activity (verified via iotop) though very
high CPU usage and the volume heal info command indicates the volume needs
healing.
I tested all of the above in virtual environments using x86-64 VMs and
could self heal as expected.
Again this only happens when using disperse volumes. Should I be filing
a bug report instead?
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users

--
Milind

_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users