[Gluster-users] op-version for reset-brick (Was: Re: [ovirt-users] Upgrading HC from 4.0 to 4.1)

Discussion:

Sahina Bose

2017-07-05 15:02:04 UTC

...
gluster volume reset-brick export ovirt01.localdomain.local:/gluster/brick3/export
start
gluster volume reset-brick export ovirt01.localdomain.local:/gluster/brick3/export
gl01.localdomain.local:/gluster/brick3/export commit force
Correct?

Yes, correct. gl01.localdomain.local should resolve correctly on all 3
nodes.

ovirt01.localdomain.local:/gluster/brick3/export start
volume reset-brick: failed: Cannot execute command. The cluster is
operating at version 30712. reset-brick command reset-brick start is
unavailable in this version.
It seems somehow in relation with this upgrade not of the commercial
solution Red Hat Gluster Storage
https://access.redhat.com/documentation/en-US/Red_Hat_
Storage/3.1/html/Installation_Guide/chap-Upgrading_Red_Hat_Storage.html
gluster volume set all cluster.op-version XXXXX
with XXXXX > 30712
It seems that latest version of commercial Red Hat Gluster Storage is 3.1
and its op-version is indeed 30712..
So the question is which particular op-version I have to set and if the
command can be set online without generating disruption....

It should have worked with the glusterfs 3.10 version from Centos repo.
Adding gluster-users for help on the op-version

Thanks,
Gianluca

Atin Mukherjee

2017-07-05 15:14:29 UTC

Permalink

Post by Sahina Bose

Yes, correct. gl01.localdomain.local should resolve correctly on all 3
nodes.

ovirt01.localdomain.local:/gluster/brick3/export start
volume reset-brick: failed: Cannot execute command. The cluster is
operating at version 30712. reset-brick command reset-brick start is
unavailable in this version.
It seems somehow in relation with this upgrade not of the commercial
solution Red Hat Gluster Storage
https://access.redhat.com/documentation/en-US/Red_Hat_Storag
e/3.1/html/Installation_Guide/chap-Upgrading_Red_Hat_Storage.html
gluster volume set all cluster.op-version XXXXX
with XXXXX > 30712
It seems that latest version of commercial Red Hat Gluster Storage is 3.1
and its op-version is indeed 30712..
So the question is which particular op-version I have to set and if the
command can be set online without generating disruption....

It should have worked with the glusterfs 3.10 version from Centos repo.
Adding gluster-users for help on the op-version

This definitely means your cluster op-version is running < 3.9.0

if (conf->op_version < GD_OP_VERSION_3_9_0
&&
strcmp (cli_op, "GF_REPLACE_OP_COMMIT_FORCE"))
{
snprintf (msg, sizeof (msg), "Cannot execute command. The
"
"cluster is operating at version %d. reset-brick
"
"command %s is unavailable in this
version.",

conf->op_version,
gd_rb_op_to_str
(cli_op));
ret =
-1;
goto
out;
}

What's the version of gluster bits are you running across the gluster
cluster? Please note cluster.op-version is not exactly the same as of rpm
version and with every upgrades it's recommended to bump up the op-version.

Post by Sahina Bose

Thanks,
Gianluca

_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Gianluca Cecchi

2017-07-05 15:13:02 UTC

Permalink

Post by Sahina Bose

Yes, correct. gl01.localdomain.local should resolve correctly on all 3
nodes.

ovirt01.localdomain.local:/gluster/brick3/export start
volume reset-brick: failed: Cannot execute command. The cluster is
operating at version 30712. reset-brick command reset-brick start is
unavailable in this version.
It seems somehow in relation with this upgrade not of the commercial
solution Red Hat Gluster Storage
https://access.redhat.com/documentation/en-US/Red_Hat_Storag
e/3.1/html/Installation_Guide/chap-Upgrading_Red_Hat_Storage.html
gluster volume set all cluster.op-version XXXXX
with XXXXX > 30712
It seems that latest version of commercial Red Hat Gluster Storage is 3.1
and its op-version is indeed 30712..
So the question is which particular op-version I have to set and if the
command can be set online without generating disruption....

It should have worked with the glusterfs 3.10 version from Centos repo.
Adding gluster-users for help on the op-version

Thanks,
Gianluca

It seems op-version is not updated automatically by default, so that it can
manage mixed versions while you update one by one...

I followed what described here:
https://gluster.readthedocs.io/en/latest/Upgrade-Guide/op_version/

- Get current version:

[***@ovirt01 ~]# gluster volume get all cluster.op-version
Option Value

------ -----

cluster.op-version 30712

[***@ovirt01 ~]#

- Get maximum version I can set for current setup:

[***@ovirt01 ~]# gluster volume get all cluster.max-op-version
Option Value

------ -----

cluster.max-op-version 31000

[***@ovirt01 ~]#

- Get op version information for all the connected clients:

[***@ovirt01 ~]# gluster volume status all clients | grep ":49" | awk
'{print $4}' | sort | uniq -c
72 31000
[***@ovirt01 ~]#

--> ok

- Update op-version

[***@ovirt01 ~]# gluster volume set all cluster.op-version 31000
volume set: success
[***@ovirt01 ~]#

- Verify:
[***@ovirt01 ~]# gluster volume get all cluster.op-versionOption
Value
------ -----

cluster.op-version 31000

[***@ovirt01 ~]#

--> ok

[***@ovirt01 ~]# gluster volume reset-brick export
ovirt01.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful

[***@ovirt01 ~]# gluster volume reset-brick export
ovirt01.localdomain.local:/gluster/brick3/export
gl01.localdomain.local:/gluster/brick3/export commit force
volume reset-brick: failed: Commit failed on ovirt02.localdomain.local.
Please check log file for details.
Commit failed on ovirt03.localdomain.local. Please check log file for
details.
[***@ovirt01 ~]#

[***@ovirt01 bricks]# gluster volume info export

Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: gl01.localdomain.local:/gluster/brick3/export
Brick2: ovirt02.localdomain.local:/gluster/brick3/export
Brick3: ovirt03.localdomain.local:/gluster/brick3/export (arbiter)
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
[***@ovirt01 bricks]# gluster volume reset-brick export
ovirt02.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
[***@ovirt01 bricks]# gluster volume reset-brick export
ovirt02.localdomain.local:/gluster/brick3/export
gl02.localdomain.local:/gluster/brick3/export commit force
volume reset-brick: failed: Commit failed on localhost. Please check log
file for details.
[***@ovirt01 bricks]#

I proceed (I have actually nothing on export volume...)

[***@ovirt01 bricks]# gluster volume reset-brick export
ovirt02.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
[***@ovirt01 bricks]# gluster volume reset-brick export
ovirt02.localdomain.local:/gluster/brick3/export
gl02.localdomain.local:/gluster/brick3/export commit force
volume reset-brick: failed: Commit failed on localhost. Please check log
file for details.
[***@ovirt01 bricks]#

Again error

[***@ovirt01 bricks]# gluster volume info export

Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 2
Transport-type: tcp
Bricks:
Brick1: gl01.localdomain.local:/gluster/brick3/export
Brick2: ovirt03.localdomain.local:/gluster/brick3/export
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
[***@ovirt01 bricks]#

The last

[***@ovirt01 bricks]# gluster volume reset-brick export
ovirt03.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
[***@ovirt01 bricks]# gluster volume reset-brick export
ovirt03.localdomain.local:/gluster/brick3/export
gl03.localdomain.local:/gluster/brick3/export commit force
volume reset-brick: failed: Commit failed on localhost. Please check log
file for details.
[***@ovirt01 bricks]#

again error

[***@ovirt01 bricks]# gluster volume info export

Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 1
Transport-type: tcp
Bricks:
Brick1: gl01.localdomain.local:/gluster/brick3/export
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
[***@ovirt01 bricks]#

See here for gluster log in gzip format....
https://drive.google.com/file/d/0BwoPbcrMv8mvQmlYZjAySTZKTzQ/view?usp=sharing

The first command executed at 14:57 and the other two at 15:04

This is what seen by oVirt right now for the volume
https://drive.google.com/file/d/0BwoPbcrMv8mvNFAyd043TnNwSEU/view?usp=sharing

(After the first command I saw 2 of 3 up)

Gianluca

Atin Mukherjee

2017-07-05 15:22:23 UTC

Permalink

And what does glusterd log indicate for these failures?

Post by Gianluca Cecchi

On Wed, Jul 5, 2017 at 8:16 PM, Gianluca Cecchi <

Yes, correct. gl01.localdomain.local should resolve correctly on all 3
nodes.

ovirt01.localdomain.local:/gluster/brick3/export start
volume reset-brick: failed: Cannot execute command. The cluster is
operating at version 30712. reset-brick command reset-brick start is
unavailable in this version.
It seems somehow in relation with this upgrade not of the commercial
solution Red Hat Gluster Storage
https://access.redhat.com/documentation/en-US/Red_Hat_Storag
e/3.1/html/Installation_Guide/chap-Upgrading_Red_Hat_Storage.html
gluster volume set all cluster.op-version XXXXX
with XXXXX > 30712
It seems that latest version of commercial Red Hat Gluster Storage is
3.1 and its op-version is indeed 30712..
So the question is which particular op-version I have to set and if the
command can be set online without generating disruption....

It should have worked with the glusterfs 3.10 version from Centos repo.
Adding gluster-users for help on the op-version

Thanks,
Gianluca

It seems op-version is not updated automatically by default, so that it
can manage mixed versions while you update one by one...
https://gluster.readthedocs.io/en/latest/Upgrade-Guide/op_version/
Option Value
------ -----
cluster.op-version 30712
Option Value
------ -----
cluster.max-op-version 31000
'{print $4}' | sort | uniq -c
72 31000
--> ok
- Update op-version
volume set: success
Value
------ -----
cluster.op-version 31000
--> ok
ovirt01.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
ovirt01.localdomain.local:/gluster/brick3/export gl01.localdomain.local:/gluster/brick3/export
commit force
volume reset-brick: failed: Commit failed on ovirt02.localdomain.local.
Please check log file for details.
Commit failed on ovirt03.localdomain.local. Please check log file for
details.
Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: gl01.localdomain.local:/gluster/brick3/export
Brick2: ovirt02.localdomain.local:/gluster/brick3/export
Brick3: ovirt03.localdomain.local:/gluster/brick3/export (arbiter)
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
ovirt02.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
ovirt02.localdomain.local:/gluster/brick3/export gl02.localdomain.local:/gluster/brick3/export
commit force
volume reset-brick: failed: Commit failed on localhost. Please check log
file for details.
I proceed (I have actually nothing on export volume...)
ovirt02.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
ovirt02.localdomain.local:/gluster/brick3/export gl02.localdomain.local:/gluster/brick3/export
commit force
volume reset-brick: failed: Commit failed on localhost. Please check log
file for details.
Again error
Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 2
Transport-type: tcp
Brick1: gl01.localdomain.local:/gluster/brick3/export
Brick2: ovirt03.localdomain.local:/gluster/brick3/export
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
The last
ovirt03.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
ovirt03.localdomain.local:/gluster/brick3/export gl03.localdomain.local:/gluster/brick3/export
commit force
volume reset-brick: failed: Commit failed on localhost. Please check log
file for details.
again error
Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 1
Transport-type: tcp
Brick1: gl01.localdomain.local:/gluster/brick3/export
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
See here for gluster log in gzip format....
https://drive.google.com/file/d/0BwoPbcrMv8mvQmlYZjAySTZKTzQ/
view?usp=sharing
The first command executed at 14:57 and the other two at 15:04
This is what seen by oVirt right now for the volume
https://drive.google.com/file/d/0BwoPbcrMv8mvNFAyd043TnNwSEU/
view?usp=sharing
(After the first command I saw 2 of 3 up)
Gianluca
_______________________________________________
Gluster-users mailing list
http://lists.gluster.org/mailman/listinfo/gluster-users

Gianluca Cecchi

2017-07-05 15:42:19 UTC

Permalink

Post by Atin Mukherjee
And what does glusterd log indicate for these failures?

See here in gzip format

https://drive.google.com/file/d/0BwoPbcrMv8mvYmlRLUgyV0pFN0k/view?usp=sharing

It seems that on each host the peer files have been updated with a new
entry "hostname2":

[***@ovirt01 ~]# cat /var/lib/glusterd/peers/*
uuid=b89311fe-257f-4e44-8e15-9bff6245d689
state=3
hostname1=ovirt02.localdomain.local
hostname2=10.10.2.103
uuid=ec81a04c-a19c-4d31-9d82-7543cefe79f3
state=3
hostname1=ovirt03.localdomain.local
hostname2=10.10.2.104
[***@ovirt01 ~]#

[***@ovirt02 ~]# cat /var/lib/glusterd/peers/*
uuid=e9717281-a356-42aa-a579-a4647a29a0bc
state=3
hostname1=ovirt01.localdomain.local
hostname2=10.10.2.102
uuid=ec81a04c-a19c-4d31-9d82-7543cefe79f3
state=3
hostname1=ovirt03.localdomain.local
hostname2=10.10.2.104
[***@ovirt02 ~]#

[***@ovirt03 ~]# cat /var/lib/glusterd/peers/*
uuid=b89311fe-257f-4e44-8e15-9bff6245d689
state=3
hostname1=ovirt02.localdomain.local
hostname2=10.10.2.103
uuid=e9717281-a356-42aa-a579-a4647a29a0bc
state=3
hostname1=ovirt01.localdomain.local
hostname2=10.10.2.102
[***@ovirt03 ~]#

But not the gluster info on the second and third node that have lost the
ovirt01/gl01 host brick information...

Eg on ovirt02

[***@ovirt02 peers]# gluster volume info export

Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 2
Transport-type: tcp
Bricks:
Brick1: ovirt02.localdomain.local:/gluster/brick3/export
Brick2: ovirt03.localdomain.local:/gluster/brick3/export
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
[***@ovirt02 peers]#

And on ovirt03

[***@ovirt03 ~]# gluster volume info export

Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 2
Transport-type: tcp
Bricks:
Brick1: ovirt02.localdomain.local:/gluster/brick3/export
Brick2: ovirt03.localdomain.local:/gluster/brick3/export
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
[***@ovirt03 ~]#

While on ovirt01 it seems isolated...

[***@ovirt01 ~]# gluster volume info export

Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 1
Transport-type: tcp
Bricks:
Brick1: gl01.localdomain.local:/gluster/brick3/export
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
[***@ovirt01 ~]#

Atin Mukherjee

2017-07-05 16:39:01 UTC

Permalink

OK, so the log just hints to the following:

[2017-07-05 15:04:07.178204] E [MSGID: 106123]
[glusterd-mgmt.c:1532:glusterd_mgmt_v3_commit] 0-management: Commit failed
for operation Reset Brick on local node
[2017-07-05 15:04:07.178214] E [MSGID: 106123]
[glusterd-replace-brick.c:649:glusterd_mgmt_v3_initiate_replace_brick_cmd_phases]
0-management: Commit Op Failed

While going through the code, glusterd_op_reset_brick () failed resulting
into these logs. Now I don't see any error logs generated from
glusterd_op_reset_brick () which makes me thing that have we failed from a
place where we log the failure in debug mode. Would you be able to restart
glusterd service with debug log mode and reran this test and share the log?

Post by Gianluca Cecchi

Post by Atin Mukherjee
And what does glusterd log indicate for these failures?

See here in gzip format
https://drive.google.com/file/d/0BwoPbcrMv8mvYmlRLUgyV0pFN0k/
view?usp=sharing
It seems that on each host the peer files have been updated with a new
uuid=b89311fe-257f-4e44-8e15-9bff6245d689
state=3
hostname1=ovirt02.localdomain.local
hostname2=10.10.2.103
uuid=ec81a04c-a19c-4d31-9d82-7543cefe79f3
state=3
hostname1=ovirt03.localdomain.local
hostname2=10.10.2.104
uuid=e9717281-a356-42aa-a579-a4647a29a0bc
state=3
hostname1=ovirt01.localdomain.local
hostname2=10.10.2.102
uuid=ec81a04c-a19c-4d31-9d82-7543cefe79f3
state=3
hostname1=ovirt03.localdomain.local
hostname2=10.10.2.104
uuid=b89311fe-257f-4e44-8e15-9bff6245d689
state=3
hostname1=ovirt02.localdomain.local
hostname2=10.10.2.103
uuid=e9717281-a356-42aa-a579-a4647a29a0bc
state=3
hostname1=ovirt01.localdomain.local
hostname2=10.10.2.102
But not the gluster info on the second and third node that have lost the
ovirt01/gl01 host brick information...
Eg on ovirt02
Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 2
Transport-type: tcp
Brick1: ovirt02.localdomain.local:/gluster/brick3/export
Brick2: ovirt03.localdomain.local:/gluster/brick3/export
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
And on ovirt03
Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 2
Transport-type: tcp
Brick1: ovirt02.localdomain.local:/gluster/brick3/export
Brick2: ovirt03.localdomain.local:/gluster/brick3/export
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
While on ovirt01 it seems isolated...
Volume Name: export
Type: Replicate
Volume ID: b00e5839-becb-47e7-844f-6ce6ce1b7153
Status: Started
Snapshot Count: 0
Number of Bricks: 0 x (2 + 1) = 1
Transport-type: tcp
Brick1: gl01.localdomain.local:/gluster/brick3/export
transport.address-family: inet
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on

Gianluca Cecchi

2017-07-05 22:17:03 UTC

Permalink

[2017-07-05 15:04:07.178204] E [MSGID: 106123] [glusterd-mgmt.c:1532:glusterd_mgmt_v3_commit]
0-management: Commit failed for operation Reset Brick on local node
[2017-07-05 15:04:07.178214] E [MSGID: 106123]
[glusterd-replace-brick.c:649:glusterd_mgmt_v3_initiate_replace_brick_cmd_phases]
0-management: Commit Op Failed
While going through the code, glusterd_op_reset_brick () failed resulting
into these logs. Now I don't see any error logs generated from
glusterd_op_reset_brick () which makes me thing that have we failed from a
place where we log the failure in debug mode. Would you be able to restart
glusterd service with debug log mode and reran this test and share the log?

What's the best way to set glusterd in debug mode?
Can I set this volume, and work on it even if it is now compromised?

I ask because I have tried this:

[***@ovirt01 ~]# gluster volume get export diagnostics.brick-log-level
Option
Value
------
-----
diagnostics.brick-log-level INFO

[***@ovirt01 ~]# gluster volume set export diagnostics.brick-log-level
DEBUG
volume set: failed: Error, Validation Failed
[***@ovirt01 ~]#

While on another volume that is in good state, I can run

[***@ovirt01 ~]# gluster volume set iso diagnostics.brick-log-level DEBUG
volume set: success
[***@ovirt01 ~]#

[***@ovirt01 ~]# gluster volume get iso diagnostics.brick-log-level
Option
Value
------
-----
diagnostics.brick-log-level DEBUG

[***@ovirt01 ~]# gluster volume set iso diagnostics.brick-log-level INFO
volume set: success
[***@ovirt01 ~]#

[***@ovirt01 ~]# gluster volume get iso diagnostics.brick-log-level
Option
Value
------
-----
diagnostics.brick-log-level
INFO
[***@ovirt01 ~]#

Do you mean to run the reset-brick command for another volume or for the
same? Can I run it against this "now broken" volume?

Or perhaps can I modify /usr/lib/systemd/system/glusterd.service and change
in [service] section

from
Environment="LOG_LEVEL=INFO"

to
Environment="LOG_LEVEL=DEBUG"

and then
systemctl daemon-reload
systemctl restart glusterd

I think it would be better to keep gluster in debug mode the less time
possible, as there are other volumes active right now, and I want to
prevent fill the log files file system
Best to put only some components in debug mode if possible as in the
example commands above.

Let me know,
thanks

Atin Mukherjee

2017-07-06 04:55:50 UTC

Permalink

Post by Gianluca Cecchi

Post by Atin Mukherjee
[2017-07-05 15:04:07.178204] E [MSGID: 106123]
[glusterd-mgmt.c:1532:glusterd_mgmt_v3_commit] 0-management: Commit
failed for operation Reset Brick on local node
[2017-07-05 15:04:07.178214] E [MSGID: 106123]
[glusterd-replace-brick.c:649:glusterd_mgmt_v3_initiate_replace_brick_cmd_phases]
0-management: Commit Op Failed
While going through the code, glusterd_op_reset_brick () failed resulting
into these logs. Now I don't see any error logs generated from
glusterd_op_reset_brick () which makes me thing that have we failed from a
place where we log the failure in debug mode. Would you be able to restart
glusterd service with debug log mode and reran this test and share the log?

Do you mean to run the reset-brick command for another volume or for the
same? Can I run it against this "now broken" volume?
Or perhaps can I modify /usr/lib/systemd/system/glusterd.service and
change in [service] section
from
Environment="LOG_LEVEL=INFO"
to
Environment="LOG_LEVEL=DEBUG"
and then
systemctl daemon-reload
systemctl restart glusterd

Yes, that's how you can run glusterd in debug log mode.

Post by Gianluca Cecchi
I think it would be better to keep gluster in debug mode the less time
possible, as there are other volumes active right now, and I want to
prevent fill the log files file system
Best to put only some components in debug mode if possible as in the
example commands above.

You can switch back to info mode the moment this is hit one more time with
the debug log enabled. What I'd need here is the glusterd log (with debug
mode) to figure out the exact cause of the failure.

Post by Gianluca Cecchi
Let me know,
thanks

Atin Mukherjee

2017-07-06 12:16:54 UTC

Permalink

Eventually I can destroy and recreate this "export" volume again with the
old names (ovirt0N.localdomain.local) if you give me the sequence of
commands, then enable debug and retry the reset-brick command
Gianluca

So it seems I was able to destroy and re-create.
Now I see that the volume creation uses by default the new ip, so I
reverted the hostnames roles in the commands after putting glusterd in
debug mode on the host where I execute the reset-brick command (do I have
to set debug for the the nodes too?)

You have to set the log level to debug for glusterd instance where the
commit fails and share the glusterd log of that particular node.

gl01.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
gl01.localdomain.local:/gluster/brick3/export ovirt01.localdomain.local:/gluster/brick3/export
commit force
volume reset-brick: failed: Commit failed on ovirt02.localdomain.local.
Please check log file for details.
Commit failed on ovirt03.localdomain.local. Please check log file for
details.
https://drive.google.com/file/d/0BwoPbcrMv8mvYmlRLUgyV0pFN0k/
view?usp=sharing
Time of the reset-brick operation in logfile is 2017-07-06 11:42
(BTW: can I have time in log not in UTC format, as I'm using CEST date in
my system?)
I see a difference, because the brick doesn't seems isolated as before...
Volume Name: export
Type: Replicate
Volume ID: e278a830-beed-4255-b9ca-587a630cbdbf
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: ovirt01.localdomain.local:/gluster/brick3/export
Brick2: 10.10.2.103:/gluster/brick3/export
Brick3: 10.10.2.104:/gluster/brick3/export (arbiter)
Volume Name: export
Type: Replicate
Volume ID: e278a830-beed-4255-b9ca-587a630cbdbf
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Brick1: ovirt01.localdomain.local:/gluster/brick3/export
Brick2: 10.10.2.103:/gluster/brick3/export
Brick3: 10.10.2.104:/gluster/brick3/export (arbiter)
And also in oVirt I see all 3 bricks online....
Gianluca

Gianluca Cecchi

2017-07-06 13:22:22 UTC

Permalink

Post by Atin Mukherjee

On Thu, Jul 6, 2017 at 8:38 AM, Gianluca Cecchi <

Eventually I can destroy and recreate this "export" volume again with
the old names (ovirt0N.localdomain.local) if you give me the sequence of
commands, then enable debug and retry the reset-brick command
Gianluca

You have to set the log level to debug for glusterd instance where the
commit fails and share the glusterd log of that particular node.

Ok, done.

Command executed on ovirt01 with timestamp "2017-07-06 13:04:12" in
glusterd log files

[***@ovirt01 export]# gluster volume reset-brick export
gl01.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful

[***@ovirt01 export]# gluster volume reset-brick export
gl01.localdomain.local:/gluster/brick3/export
ovirt01.localdomain.local:/gluster/brick3/export commit force
volume reset-brick: failed: Commit failed on ovirt02.localdomain.local.
Please check log file for details.
Commit failed on ovirt03.localdomain.local. Please check log file for
details.
[***@ovirt01 export]#

See glusterd log files for the 3 nodes in debug mode here:
ovirt01:
https://drive.google.com/file/d/0BwoPbcrMv8mvY1RTTGp3RUhScm8/view?usp=sharing
ovirt02:
https://drive.google.com/file/d/0BwoPbcrMv8mvSVpJUHNhMzhMSU0/view?usp=sharing
ovirt03:
https://drive.google.com/file/d/0BwoPbcrMv8mvT2xiWEdQVmJNb0U/view?usp=sharing

HIH debugging
Gianluca

Atin Mukherjee

2017-07-07 09:57:03 UTC

Permalink

You'd need to allow some more time to dig into the logs. I'll try to get
back on this by Monday.

Post by Gianluca Cecchi

On Thu, Jul 6, 2017 at 5:26 PM, Gianluca Cecchi <

On Thu, Jul 6, 2017 at 8:38 AM, Gianluca Cecchi <

Eventually I can destroy and recreate this "export" volume again with
the old names (ovirt0N.localdomain.local) if you give me the sequence of
commands, then enable debug and retry the reset-brick command
Gianluca

You have to set the log level to debug for glusterd instance where the
commit fails and share the glusterd log of that particular node.

Ok, done.
Command executed on ovirt01 with timestamp "2017-07-06 13:04:12" in
glusterd log files
gl01.localdomain.local:/gluster/brick3/export start
volume reset-brick: success: reset-brick start operation successful
gl01.localdomain.local:/gluster/brick3/export
ovirt01.localdomain.local:/gluster/brick3/export commit force
volume reset-brick: failed: Commit failed on ovirt02.localdomain.local.
Please check log file for details.
Commit failed on ovirt03.localdomain.local. Please check log file for
details.
ovirt01: https://drive.google.com/file/d/0BwoPbcrMv8mvY1RTTG
p3RUhScm8/view?usp=sharing
ovirt02: https://drive.google.com/file/d/0BwoPbcrMv8mvSVpJUH
NhMzhMSU0/view?usp=sharing
ovirt03: https://drive.google.com/file/d/0BwoPbcrMv8mvT2xiWE
dQVmJNb0U/view?usp=sharing
HIH debugging
Gianluca

Hi Atin,
did you have time to see the logs?
Comparing debug enabled messages with previous ones, I see these added
lines on nodes where commit failed after running the commands
gluster volume reset-brick export gl01.localdomain.local:/gluster/brick3/export
start
gluster volume reset-brick export gl01.localdomain.local:/gluster/brick3/export
ovirt01.localdomain.local:/gluster/brick3/export commit force
[2017-07-06 13:04:30.221872] D [MSGID: 0] [glusterd-peer-utils.c:674:gd_peerinfo_find_from_hostname]
0-management: Friend ovirt01.localdomain.local found.. state: 3
[2017-07-06 13:04:30.221882] D [MSGID: 0] [glusterd-peer-utils.c:167:glusterd_hostname_to_uuid]
0-management: returning 0
[2017-07-06 13:04:30.221888] D [MSGID: 0] [glusterd-utils.c:1039:glusterd_resolve_brick]
0-management: Returning 0
[2017-07-06 13:04:30.221908] D [MSGID: 0] [glusterd-utils.c:998:glusterd_brickinfo_new]
0-management: Returning 0
glusterd_brickinfo_new_from_brick] 0-management: Returning 0
[2017-07-06 13:04:30.222187] D [MSGID: 0] [glusterd-peer-utils.c:167:glusterd_hostname_to_uuid]
0-management: returning 0
[2017-07-06 13:04:30.222201] D [MSGID: 0] [glusterd-utils.c:1486:glusterd_volume_brickinfo_get]
0-management: Returning -1
[2017-07-06 13:04:30.222207] D [MSGID: 0] [store.c:459:gf_store_handle_destroy]
0-: Returning 0
glusterd_volume_brickinfo_get_by_brick] 0-glusterd: Returning -1
glusterd_op_perform_replace_brick] 0-glusterd: Returning -1
[2017-07-06 13:04:30.222257] C [MSGID: 106074] [glusterd-reset-brick.c:372:glusterd_op_reset_brick]
0-management: Unable to add dst-brick: ovirt01.localdomain.local:/gluster/brick3/export
to volume: export
Does it share up more light?
Thanks,
Gianluca

Atin Mukherjee

2017-07-10 08:41:40 UTC

Permalink

Post by Gianluca Cecchi

On Thu, Jul 6, 2017 at 5:26 PM, Gianluca Cecchi <

On Thu, Jul 6, 2017 at 8:38 AM, Gianluca Cecchi <

Eventually I can destroy and recreate this "export" volume again with
the old names (ovirt0N.localdomain.local) if you give me the sequence of
commands, then enable debug and retry the reset-brick command
Gianluca

You have to set the log level to debug for glusterd instance where the
commit fails and share the glusterd log of that particular node.

Hi Atin,
did you have time to see the logs?
Comparing debug enabled messages with previous ones, I see these added
lines on nodes where commit failed after running the commands
gluster volume reset-brick export gl01.localdomain.local:/gluster/brick3/export
start
gluster volume reset-brick export gl01.localdomain.local:/gluster/brick3/export
ovirt01.localdomain.local:/gluster/brick3/export commit force
[2017-07-06 13:04:30.221872] D [MSGID: 0] [glusterd-peer-utils.c:674:gd_peerinfo_find_from_hostname]
0-management: Friend ovirt01.localdomain.local found.. state: 3
[2017-07-06 13:04:30.221882] D [MSGID: 0] [glusterd-peer-utils.c:167:glusterd_hostname_to_uuid]
0-management: returning 0
[2017-07-06 13:04:30.221888] D [MSGID: 0] [glusterd-utils.c:1039:glusterd_resolve_brick]
0-management: Returning 0
[2017-07-06 13:04:30.221908] D [MSGID: 0] [glusterd-utils.c:998:glusterd_brickinfo_new]
0-management: Returning 0
[2017-07-06 13:04:30.221915] D [MSGID: 0] [glusterd-utils.c:1195:glusterd_brickinfo_new_from_brick]
0-management: Returning 0
[2017-07-06 13:04:30.222187] D [MSGID: 0] [glusterd-peer-utils.c:167:glusterd_hostname_to_uuid]
0-management: returning 0
[2017-07-06 13:04:30.222201] D [MSGID: 0] [glusterd-utils.c:1486:glusterd_volume_brickinfo_get]
0-management: Returning -1

The above log entry is the reason of the failure. GlusterD is unable to
find the old brick (src_brick) from its volinfo structure. FWIW, would you
be able to share the content of 'gluster get-state' output & gluster volume
info output after running reset-brick start? I'd need to check why glusterd
is unable to find out the old brick's details from its volinfo post
reset-brick start.

[2017-07-06 13:04:30.222207] D [MSGID: 0] [store.c:459:gf_store_handle_destroy]
0-: Returning 0
[2017-07-06 13:04:30.222242] D [MSGID: 0] [glusterd-utils.c:1512:gluster
d_volume_brickinfo_get_by_brick] 0-glusterd: Returning -1
glusterd_op_perform_replace_brick] 0-glusterd: Returning -1
[2017-07-06 13:04:30.222257] C [MSGID: 106074]
[glusterd-reset-brick.c:372:glusterd_op_reset_brick] 0-management: Unable
to add dst-brick: ovirt01.localdomain.local:/gluster/brick3/export to
volume: export
Does it share up more light?
Thanks,
Gianluca