Discussion:
[Gluster-users] Rebalance taking > 2 months
Rusty Bower
2018-07-16 00:43:18 UTC
Permalink
Hey folks,

I just added a new brick to my existing gluster volume, but *gluster volume
rebalance data status* is telling me the following: Estimated time left for
rebalance to complete : > 2 months. Please try again later.

I already did a fix-mapping, but this thing is absolutely crawling trying
to rebalance everything (last estimate was ~40 years)

Any thoughts on if this is a bug, or ways to speed this up? It's taking ~6
hours to scan 6000 files, which seems unreasonably slow.

Thanks
Rusty
Nithya Balachandran
2018-07-16 04:44:37 UTC
Permalink
Hi Rusty,

We need the following information:

1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on your
volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size

Please note that having a rebalance running in the background should not
affect your volume access in any way. However I would like to know why only
6000 files have been scanned in 6 hours.

Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the following: Estimated time
left for rebalance to complete : > 2 months. Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling trying
to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's taking ~6
hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Nithya Balachandran
2018-07-16 12:37:08 UTC
Permalink
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on your
volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background should not
affect your volume access in any way. However I would like to know why only
6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the following: Estimated
time left for rebalance to complete : > 2 months. Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling trying
to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's taking
~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Rusty Bower
2018-07-16 16:36:04 UTC
Permalink
Thanks for the reply Nithya.

1. glusterfs 4.1.1

2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
Options Reconfigured:
performance.readdir-ahead: on

3.
Node Rebalanced-files size
scanned failures skipped status run time in
h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------
--------------
localhost 36822 11.3GB
50715 0 0 in progress 26:46:17
datanode02 0 0Bytes
2852 0 0 in progress 26:46:16
datanode03 3128 513.7MB
11442 0 3128 in progress 26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try
again later.
volume rebalance: data: success

4. Directory structure is basically an rsync backup of some old systems as
well as all of my personal media. I can elaborate more, but it's a pretty
standard filesystem.

5. In some folders there might be up to like 12-15 levels of directories
(especially the backups)

6. I'm honestly not sure, I can try to scrounge this number up

7. My guess would be > 100k

8. Most files are pretty large (media files), but there's a lot of small
files (metadata and configuration files) as well

I've also appended a (moderately sanitized) snippet of the rebalance log
(let me know if you need more)

[2018-07-16 17:37:59.979003] I [MSGID: 0]
[dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination for file -
/this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml is changed to
- data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of
/this/is/a/file/path/that/exists/wz/wz/Npc.wz/2112002.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:00.725582] I
[dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs:
TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated:
36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I
[dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs:
TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated:
36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml:
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of
/this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml:
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of
/this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.812368] I
[dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs:
TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated:
36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml:
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of
/this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:06.855618] I
[dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs:
TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated:
36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201055.img.xml:
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of
/this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:08.899708] I
[dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs:
TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated:
36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on your
volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background should not
affect your volume access in any way. However I would like to know why only
6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the following: Estimated
time left for rebalance to complete : > 2 months. Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling
trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's taking
~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Nithya Balachandran
2018-07-23 08:12:15 UTC
Permalink
Hi Rusty,

Sorry I took so long to get back to you.

Which is the newly added brick? I see datanode02 has not picked up any
files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?

We can try using scripts to speed up the rebalance if you prefer.

Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files size
scanned failures skipped status run time in
h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------
--------------
localhost 36822 11.3GB
50715 0 0 in progress 26:46:17
datanode02 0 0Bytes
2852 0 0 in progress 26:46:16
datanode03 3128 513.7MB
11442 0 3128 in progress 26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try
again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old systems as
well as all of my personal media. I can elaborate more, but it's a pretty
standard filesystem.
5. In some folders there might be up to like 12-15 levels of directories
(especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of small
files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance log
(let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0] [dht-rebalance.c:1799:dht_migrate_file]
0-data-dht: destination for file - /this/is/a/file/path/that/
exists/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file]
0-data-dht: completed migration of /this/is/a/file/path/that/
exists/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_
defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size)
total_processed=43108305980 tmp_cnt = 55419279917056,rate_processed=446597.869797,
elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get]
0-glusterfs: Rebalance is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get]
0-glusterfs: Files migrated: 36876, size: 12270259289, lookups: 50715,
failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_
defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size)
total_processed=43108305980 tmp_cnt = 55419279917056,rate_processed=446588.616567,
elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get]
0-glusterfs: Rebalance is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get]
0-glusterfs: Files migrated: 36876, size: 12270259289, lookups: 50715,
failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file]
0-data-dht: completed migration of /this/is/a/file/path/that/
exists/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file]
0-data-dht: completed migration of /this/is/a/file/path/that/
exists/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_
defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size)
total_processed=43108308134 tmp_cnt = 55419279917056,rate_processed=446579.386035,
elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get]
0-glusterfs: Rebalance is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get]
0-glusterfs: Files migrated: 36877, size: 12270261443, lookups: 50715,
failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file]
0-data-dht: completed migration of /this/is/a/file/path/that/
exists/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_
defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size)
total_processed=43108318798 tmp_cnt = 55419279917056,rate_processed=446570.244043,
elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get]
0-glusterfs: Rebalance is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get]
0-glusterfs: Files migrated: 36879, size: 12270266602, lookups: 50715,
failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file]
0-data-dht: completed migration of /this/is/a/file/path/that/
exists/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_
defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size)
total_processed=43108318798 tmp_cnt = 55419279917056,rate_processed=446560.991961,
elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get]
0-glusterfs: Rebalance is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get]
0-glusterfs: Files migrated: 36879, size: 12270266602, lookups: 50715,
failures: 0, skipped: 0
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on your
volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background should not
affect your volume access in any way. However I would like to know why only
6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the following: Estimated
time left for rebalance to complete : > 2 months. Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling
trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's taking
~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Rusty Bower
2018-07-23 08:19:50 UTC
Permalink
datanode03 is the newest brick

the bricks had gotten pretty full, which I think might be part of the issue:
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data

each of the bricks are on a completely separate disk from the OS

I'll shoot you the log files offline :)

Thanks!
Rusty
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up any
files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files size
scanned failures skipped status run time in
h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------
--------------
localhost 36822 11.3GB
50715 0 0 in progress 26:46:17
datanode02 0 0Bytes
2852 0 0 in progress 26:46:16
datanode03 3128 513.7MB
11442 0 3128 in progress 26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try
again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old systems
as well as all of my personal media. I can elaborate more, but it's a
pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of directories
(especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of small
files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance log
(let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0] [dht-rebalance.c:1799:dht_migrate_file]
0-data-dht: destination for file - /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2112002.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in
progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on
your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background should
not affect your volume access in any way. However I would like to know why
only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the following: Estimated
time left for rebalance to complete : > 2 months. Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling
trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's
taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Rusty Bower
2018-07-28 01:41:29 UTC
Permalink
Just wanted to ping this to see if you guys had any thoughts, or other
scripts I can run for this stuff. It's still predicting another 90 days to
rebalance this, and performance is basically garbage while it rebalances.

Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up any
files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files size
scanned failures skipped status run time in
h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------
--------------
localhost 36822 11.3GB
50715 0 0 in progress 26:46:17
datanode02 0 0Bytes
2852 0 0 in progress 26:46:16
datanode03 3128 513.7MB
11442 0 3128 in progress 26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try
again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old systems
as well as all of my personal media. I can elaborate more, but it's a
pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of directories
(especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of small
files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0] [dht-rebalance.c:1799:dht_migrate_file]
0-data-dht: destination for file - /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2112002.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration
of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml from
subvolume data-client-0 to data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on
your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background should
not affect your volume access in any way. However I would like to know why
only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the following: Estimated
time left for rebalance to complete : > 2 months. Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling
trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's
taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Nithya Balachandran
2018-07-30 08:40:35 UTC
Permalink
Hi Rusty,

Sorry for the delay getting back to you. I had a quick look at the
rebalance logs - it looks like the estimates are based on the time taken to
rebalance the smaller files.

We do have a scripting option where we can use virtual xattrs to trigger
file migration from a mount point. That would speed things up.


Regards,
Nithya
Post by Rusty Bower
Just wanted to ping this to see if you guys had any thoughts, or other
scripts I can run for this stuff. It's still predicting another 90 days to
rebalance this, and performance is basically garbage while it rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up any
files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files
size scanned failures skipped status run
time in h:m:s
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 36822
11.3GB 50715 0 0 in progress
26:46:17
datanode02 0
0Bytes 2852 0 0 in progress
26:46:16
datanode03 3128
513.7MB 11442 0 3128 in progress
26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try
again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old systems
as well as all of my personal media. I can elaborate more, but it's a
pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of
directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of
small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0] [dht-rebalance.c:1799:dht_migrate_file]
0-data-dht: destination for file - /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml: attempting to move from data-client-0
to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml: attempting to move from data-client-0
to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml: attempting to move from data-client-0
to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201055.img.xml: attempting to move from data-client-0
to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on
your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background should
not affect your volume access in any way. However I would like to know why
only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the
following: Estimated time left for rebalance to complete : > 2 months.
Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling
trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's
taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Rusty Bower
2018-07-30 14:53:46 UTC
Permalink
That would be awesome. Where can I find these?

Rusty

Sent from my iPhone
Post by Nithya Balachandran
Hi Rusty,
Sorry for the delay getting back to you. I had a quick look at the rebalance logs - it looks like the estimates are based on the time taken to rebalance the smaller files.
We do have a scripting option where we can use virtual xattrs to trigger file migration from a mount point. That would speed things up.
Regards,
Nithya
Just wanted to ping this to see if you guys had any thoughts, or other scripts I can run for this stuff. It's still predicting another 90 days to rebalance this, and performance is basically garbage while it rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up any files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 36822 11.3GB 50715 0 0 in progress 26:46:17
datanode02 0 0Bytes 2852 0 0 in progress 26:46:16
datanode03 3128 513.7MB 11442 0 3128 in progress 26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old systems as well as all of my personal media. I can elaborate more, but it's a pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0] [dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination for file - /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt = 55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124092127 seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt = 55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124094698 seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt = 55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124097263 seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt = 55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124099804 seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file] 0-data-dht: /this/is/a/file/path/that/exists/wz/wz/Npc.wz/9201055.img.xml: attempting to move from data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022] [dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed migration of /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size] 0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt = 55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 124102375 seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028] [dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028] [dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
The exact gluster version you are running
gluster volume info <volname>
gluster rebalance status
Information on the directory structure and file locations on your volume.
How many levels of directories
How many files and directories in each level
How many directories and files in total (a rough estimate)
Average file size
Please note that having a rebalance running in the background should not affect your volume access in any way. However I would like to know why only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but gluster volume rebalance data status is telling me the following: Estimated time left for rebalance to complete : > 2 months. Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Nithya Balachandran
2018-07-30 16:48:02 UTC
Permalink
I have not documented this yet - I will send you the steps tomorrow.

Regards,
Nithya
Post by Rusty Bower
That would be awesome. Where can I find these?
Rusty
Sent from my iPhone
Hi Rusty,
Sorry for the delay getting back to you. I had a quick look at the
rebalance logs - it looks like the estimates are based on the time taken to
rebalance the smaller files.
We do have a scripting option where we can use virtual xattrs to trigger
file migration from a mount point. That would speed things up.
Regards,
Nithya
Post by Rusty Bower
Just wanted to ping this to see if you guys had any thoughts, or other
scripts I can run for this stuff. It's still predicting another 90 days to
rebalance this, and performance is basically garbage while it rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
On Mon, Jul 23, 2018 at 3:12 AM, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up any
files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files
size scanned failures skipped status run
time in h:m:s
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 36822
11.3GB 50715 0 0 in progress
26:46:17
datanode02 0
0Bytes 2852 0 0 in progress
26:46:16
datanode03 3128
513.7MB 11442 0 3128 in progress
26:46:17
Estimated time left for rebalance to complete : > 2 months. Please try again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old
systems as well as all of my personal media. I can elaborate more, but it's
a pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of
directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of
small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0]
[dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination for
file - /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml
is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201055.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance is
in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on
your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background should
not affect your volume access in any way. However I would like to know why
only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the
following: Estimated time left for rebalance to complete : > 2 months.
Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling
trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's
taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Nithya Balachandran
2018-07-31 08:28:33 UTC
Permalink
Hi Rusty,

A rebalance involves 2 steps:

1. Setting a new layout on a directory
2. Migrating any files inside that directory that hash to a different
subvol based on the new layout set in step 1.


A few things to keep in mind :

- Any new content created on this volume will currently go to the newly
added brick.
- Having a more equitable file distribution is beneficial but you might
not need to do a complete rebalance to do this. You can run the script on
just enough directories to free up space on your older bricks. This should
be done on bricks which contains large files to speed this up.

Do the following on one of the server nodes:

- Create a tmp mount point and mount the volume using the rebalance
volfile
- mkdir /mnt/rebal
- glusterfs -s localhost --volfile-id rebalance/data /mnt/rebal
- Select a directory in the volume which contains a lot of large files
and which has not been processed by the rebalance yet - the lower down in
the tree the better. Check the rebalance logs to figure out which dirs have
not been processed yet.
- cd /mnt/rebal/<chosen_dir>
- for dir in `find . -type d`; do echo $dir |xargs -0 -n1 -P10 bash
process_dir.sh;done
- You can run this for different values of <chosen_dir> and on multiple
server nodes in parallel as long as the directory trees for the different
<chosen_dirs> don't overlap.
- Do this for multiple directories until the disk space used reduces on
the older bricks.

This is a very simple script. Let me know how it works - we can always
tweak it for your particular data set.
Post by Rusty Bower
and performance is basically garbage while it rebalances
Can you provide more detail on this? What kind of effects are you seeing?
How many clients access this volume?


Regards,
Nithya
Post by Rusty Bower
I have not documented this yet - I will send you the steps tomorrow.
Regards,
Nithya
Post by Rusty Bower
That would be awesome. Where can I find these?
Rusty
Sent from my iPhone
Hi Rusty,
Sorry for the delay getting back to you. I had a quick look at the
rebalance logs - it looks like the estimates are based on the time taken to
rebalance the smaller files.
We do have a scripting option where we can use virtual xattrs to trigger
file migration from a mount point. That would speed things up.
Regards,
Nithya
Post by Rusty Bower
Just wanted to ping this to see if you guys had any thoughts, or other
scripts I can run for this stuff. It's still predicting another 90 days to
rebalance this, and performance is basically garbage while it rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
On Mon, Jul 23, 2018 at 3:12 AM, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up
any files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files
size scanned failures skipped status run
time in h:m:s
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 36822
11.3GB 50715 0 0 in progress
26:46:17
datanode02 0
0Bytes 2852 0 0 in progress
26:46:16
datanode03 3128
513.7MB 11442 0 3128 in progress
26:46:17
Estimated time left for rebalance to complete : > 2 months. Please
try again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old
systems as well as all of my personal media. I can elaborate more, but it's
a pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of
directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of
small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0]
[dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination for
file - /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml
is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201055.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations on
your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background
should not affect your volume access in any way. However I would like to
know why only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the
following: Estimated time left for rebalance to complete : > 2 months.
Please try again later.
I already did a fix-mapping, but this thing is absolutely crawling
trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's
taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Rusty Bower
2018-07-31 14:14:51 UTC
Permalink
I'll figure out what hasn't been rebalanced yet and run the script.

There's only a single client accessing this gluster volume, and while the
rebalance is taking place, the I am only able to read/write to the volume
at around 3MB/s. If I log onto one of the bricks, I can read/write to the
physical volumes at speed greater than 100MB/s (which is what I would
expect).

Thanks!
Rusty
Post by Nithya Balachandran
Hi Rusty,
1. Setting a new layout on a directory
2. Migrating any files inside that directory that hash to a different
subvol based on the new layout set in step 1.
- Any new content created on this volume will currently go to the
newly added brick.
- Having a more equitable file distribution is beneficial but you
might not need to do a complete rebalance to do this. You can run the
script on just enough directories to free up space on your older bricks.
This should be done on bricks which contains large files to speed this up.
- Create a tmp mount point and mount the volume using the rebalance
volfile
- mkdir /mnt/rebal
- glusterfs -s localhost --volfile-id rebalance/data /mnt/rebal
- Select a directory in the volume which contains a lot of large files
and which has not been processed by the rebalance yet - the lower down in
the tree the better. Check the rebalance logs to figure out which dirs have
not been processed yet.
- cd /mnt/rebal/<chosen_dir>
- for dir in `find . -type d`; do echo $dir |xargs -0 -n1 -P10 bash
process_dir.sh;done
- You can run this for different values of <chosen_dir> and on
multiple server nodes in parallel as long as the directory trees for the
different <chosen_dirs> don't overlap.
- Do this for multiple directories until the disk space used reduces
on the older bricks.
This is a very simple script. Let me know how it works - we can always
tweak it for your particular data set.
Post by Rusty Bower
and performance is basically garbage while it rebalances
Can you provide more detail on this? What kind of effects are you seeing?
How many clients access this volume?
Regards,
Nithya
Post by Rusty Bower
I have not documented this yet - I will send you the steps tomorrow.
Regards,
Nithya
Post by Rusty Bower
That would be awesome. Where can I find these?
Rusty
Sent from my iPhone
Hi Rusty,
Sorry for the delay getting back to you. I had a quick look at the
rebalance logs - it looks like the estimates are based on the time taken to
rebalance the smaller files.
We do have a scripting option where we can use virtual xattrs to trigger
file migration from a mount point. That would speed things up.
Regards,
Nithya
Post by Rusty Bower
Just wanted to ping this to see if you guys had any thoughts, or other
scripts I can run for this stuff. It's still predicting another 90 days to
rebalance this, and performance is basically garbage while it rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
On Mon, Jul 23, 2018 at 3:12 AM, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up
any files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files
size scanned failures skipped status run
time in h:m:s
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 36822
11.3GB 50715 0 0 in progress
26:46:17
datanode02 0
0Bytes 2852 0 0 in progress
26:46:16
datanode03 3128
513.7MB 11442 0 3128 in progress
26:46:17
Estimated time left for rebalance to complete : > 2 months. Please
try again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old
systems as well as all of my personal media. I can elaborate more, but it's
a pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of
directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of
small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0]
[dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination for
file - /this/is/a/file/path/that/exists/wz/wz/Npc.wz/2040036.img.xml
is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201055.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations
on your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background
should not affect your volume access in any way. However I would like to
know why only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the
following: Estimated time left for rebalance to complete : > 2 months.
Please try again later.
I already did a fix-mapping, but this thing is absolutely
crawling trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's
taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Nithya Balachandran
2018-07-31 16:40:44 UTC
Permalink
Post by Rusty Bower
I'll figure out what hasn't been rebalanced yet and run the script.
There's only a single client accessing this gluster volume, and while the
rebalance is taking place, the I am only able to read/write to the volume
at around 3MB/s. If I log onto one of the bricks, I can read/write to the
physical volumes at speed greater than 100MB/s (which is what I would
expect).
What are the numbers when accessing the volume when rebalance is not
running?
Regards,
Nithya
Post by Rusty Bower
Thanks!
Rusty
Post by Nithya Balachandran
Hi Rusty,
1. Setting a new layout on a directory
2. Migrating any files inside that directory that hash to a different
subvol based on the new layout set in step 1.
- Any new content created on this volume will currently go to the
newly added brick.
- Having a more equitable file distribution is beneficial but you
might not need to do a complete rebalance to do this. You can run the
script on just enough directories to free up space on your older bricks.
This should be done on bricks which contains large files to speed this up.
- Create a tmp mount point and mount the volume using the rebalance
volfile
- mkdir /mnt/rebal
- glusterfs -s localhost --volfile-id rebalance/data /mnt/rebal
- Select a directory in the volume which contains a lot of large
files and which has not been processed by the rebalance yet - the lower
down in the tree the better. Check the rebalance logs to figure out which
dirs have not been processed yet.
- cd /mnt/rebal/<chosen_dir>
- for dir in `find . -type d`; do echo $dir |xargs -0 -n1 -P10
bash process_dir.sh;done
- You can run this for different values of <chosen_dir> and on
multiple server nodes in parallel as long as the directory trees for the
different <chosen_dirs> don't overlap.
- Do this for multiple directories until the disk space used reduces
on the older bricks.
This is a very simple script. Let me know how it works - we can always
tweak it for your particular data set.
Post by Rusty Bower
and performance is basically garbage while it rebalances
Can you provide more detail on this? What kind of effects are you seeing?
How many clients access this volume?
Regards,
Nithya
Post by Rusty Bower
I have not documented this yet - I will send you the steps tomorrow.
Regards,
Nithya
Post by Rusty Bower
That would be awesome. Where can I find these?
Rusty
Sent from my iPhone
Hi Rusty,
Sorry for the delay getting back to you. I had a quick look at the
rebalance logs - it looks like the estimates are based on the time taken to
rebalance the smaller files.
We do have a scripting option where we can use virtual xattrs to
trigger file migration from a mount point. That would speed things up.
Regards,
Nithya
Post by Rusty Bower
Just wanted to ping this to see if you guys had any thoughts, or other
scripts I can run for this stuff. It's still predicting another 90 days to
rebalance this, and performance is basically garbage while it rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
On Mon, Jul 23, 2018 at 3:12 AM, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up
any files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files
size scanned failures skipped status run
time in h:m:s
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 36822
11.3GB 50715 0 0 in progress
26:46:17
datanode02 0
0Bytes 2852 0 0 in progress
26:46:16
datanode03 3128
513.7MB 11442 0 3128 in progress
26:46:17
Estimated time left for rebalance to complete : > 2 months. Please
try again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old
systems as well as all of my personal media. I can elaborate more, but it's
a pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of
directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of
small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0]
[dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination
for file - /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201055.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
[dht-rebalance.c:5210:gf_defrag_status_get] 0-glusterfs: Rebalance
is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations
on your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background
should not affect your volume access in any way. However I would like to
know why only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the
following: Estimated time left for rebalance to complete : > 2 months.
Please try again later.
I already did a fix-mapping, but this thing is absolutely
crawling trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up? It's
taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Rusty Bower
2018-07-31 16:47:56 UTC
Permalink
Is it possible to pause the rebalance to get those number? I'm hesitant to
stop the rebalance and have to redo the entire thing again.
Post by Nithya Balachandran
Post by Rusty Bower
I'll figure out what hasn't been rebalanced yet and run the script.
There's only a single client accessing this gluster volume, and while the
rebalance is taking place, the I am only able to read/write to the volume
at around 3MB/s. If I log onto one of the bricks, I can read/write to the
physical volumes at speed greater than 100MB/s (which is what I would
expect).
What are the numbers when accessing the volume when rebalance is not
running?
Regards,
Nithya
Post by Rusty Bower
Thanks!
Rusty
Post by Nithya Balachandran
Hi Rusty,
1. Setting a new layout on a directory
2. Migrating any files inside that directory that hash to a
different subvol based on the new layout set in step 1.
- Any new content created on this volume will currently go to the
newly added brick.
- Having a more equitable file distribution is beneficial but you
might not need to do a complete rebalance to do this. You can run the
script on just enough directories to free up space on your older bricks.
This should be done on bricks which contains large files to speed this up.
- Create a tmp mount point and mount the volume using the rebalance
volfile
- mkdir /mnt/rebal
- glusterfs -s localhost --volfile-id rebalance/data /mnt/rebal
- Select a directory in the volume which contains a lot of large
files and which has not been processed by the rebalance yet - the lower
down in the tree the better. Check the rebalance logs to figure out which
dirs have not been processed yet.
- cd /mnt/rebal/<chosen_dir>
- for dir in `find . -type d`; do echo $dir |xargs -0 -n1 -P10
bash process_dir.sh;done
- You can run this for different values of <chosen_dir> and on
multiple server nodes in parallel as long as the directory trees for the
different <chosen_dirs> don't overlap.
- Do this for multiple directories until the disk space used reduces
on the older bricks.
This is a very simple script. Let me know how it works - we can always
tweak it for your particular data set.
Post by Rusty Bower
and performance is basically garbage while it rebalances
Can you provide more detail on this? What kind of effects are you seeing?
How many clients access this volume?
Regards,
Nithya
Post by Rusty Bower
I have not documented this yet - I will send you the steps tomorrow.
Regards,
Nithya
Post by Rusty Bower
That would be awesome. Where can I find these?
Rusty
Sent from my iPhone
Hi Rusty,
Sorry for the delay getting back to you. I had a quick look at the
rebalance logs - it looks like the estimates are based on the time taken to
rebalance the smaller files.
We do have a scripting option where we can use virtual xattrs to
trigger file migration from a mount point. That would speed things up.
Regards,
Nithya
Post by Rusty Bower
Just wanted to ping this to see if you guys had any thoughts, or
other scripts I can run for this stuff. It's still predicting another 90
days to rebalance this, and performance is basically garbage while it
rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
On Mon, Jul 23, 2018 at 3:12 AM, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked up
any files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files
size scanned failures skipped status run
time in h:m:s
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 36822
11.3GB 50715 0 0 in progress
26:46:17
datanode02 0
0Bytes 2852 0 0 in progress
26:46:16
datanode03 3128
513.7MB 11442 0 3128 in progress
26:46:17
Estimated time left for rebalance to complete : > 2 months. Please
try again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old
systems as well as all of my personal media. I can elaborate more, but it's
a pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of
directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot of
small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0]
[dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination
for file - /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed =
96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed =
96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed =
96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed =
96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201055.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed =
96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
On 16 July 2018 at 10:14, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file locations
on your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background
should not affect your volume access in any way. However I would like to
know why only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the
following: Estimated time left for rebalance to complete : > 2 months.
Please try again later.
I already did a fix-mapping, but this thing is absolutely
crawling trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up?
It's taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Nithya Balachandran
2018-08-01 08:17:31 UTC
Permalink
Post by Rusty Bower
Is it possible to pause the rebalance to get those number? I'm hesitant to
stop the rebalance and have to redo the entire thing again.
I'm afraid not. Rebalance will start from the beginning if you do so.
Post by Nithya Balachandran
Post by Rusty Bower
I'll figure out what hasn't been rebalanced yet and run the script.
There's only a single client accessing this gluster volume, and while
the rebalance is taking place, the I am only able to read/write to the
volume at around 3MB/s. If I log onto one of the bricks, I can read/write
to the physical volumes at speed greater than 100MB/s (which is what I
would expect).
What are the numbers when accessing the volume when rebalance is not
running?
Regards,
Nithya
Post by Rusty Bower
Thanks!
Rusty
On Tue, Jul 31, 2018 at 3:28 AM, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
1. Setting a new layout on a directory
2. Migrating any files inside that directory that hash to a
different subvol based on the new layout set in step 1.
- Any new content created on this volume will currently go to the
newly added brick.
- Having a more equitable file distribution is beneficial but you
might not need to do a complete rebalance to do this. You can run the
script on just enough directories to free up space on your older bricks.
This should be done on bricks which contains large files to speed this up.
- Create a tmp mount point and mount the volume using the rebalance
volfile
- mkdir /mnt/rebal
- glusterfs -s localhost --volfile-id rebalance/data /mnt/rebal
- Select a directory in the volume which contains a lot of large
files and which has not been processed by the rebalance yet - the lower
down in the tree the better. Check the rebalance logs to figure out which
dirs have not been processed yet.
- cd /mnt/rebal/<chosen_dir>
- for dir in `find . -type d`; do echo $dir |xargs -0 -n1 -P10
bash process_dir.sh;done
- You can run this for different values of <chosen_dir> and on
multiple server nodes in parallel as long as the directory trees for the
different <chosen_dirs> don't overlap.
- Do this for multiple directories until the disk space used
reduces on the older bricks.
This is a very simple script. Let me know how it works - we can always
tweak it for your particular data set.
Post by Rusty Bower
and performance is basically garbage while it rebalances
Can you provide more detail on this? What kind of effects are you seeing?
How many clients access this volume?
Regards,
Nithya
Post by Rusty Bower
I have not documented this yet - I will send you the steps tomorrow.
Regards,
Nithya
Post by Rusty Bower
That would be awesome. Where can I find these?
Rusty
Sent from my iPhone
Hi Rusty,
Sorry for the delay getting back to you. I had a quick look at the
rebalance logs - it looks like the estimates are based on the time taken to
rebalance the smaller files.
We do have a scripting option where we can use virtual xattrs to
trigger file migration from a mount point. That would speed things up.
Regards,
Nithya
Post by Rusty Bower
Just wanted to ping this to see if you guys had any thoughts, or
other scripts I can run for this stuff. It's still predicting another 90
days to rebalance this, and performance is basically garbage while it
rebalances.
Rusty
Post by Rusty Bower
datanode03 is the newest brick
- datanode01 /dev/sda1 51T 48T 3.3T 94% /mnt/data
- datanode02 /dev/sda1 51T 48T 3.4T 94% /mnt/data
- datanode03 /dev/md0 128T 4.6T 123T 4% /mnt/data
each of the bricks are on a completely separate disk from the OS
I'll shoot you the log files offline :)
Thanks!
Rusty
On Mon, Jul 23, 2018 at 3:12 AM, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
Sorry I took so long to get back to you.
Which is the newly added brick? I see datanode02 has not picked
up any files for migration which is odd.
How full are the individual bricks (df -h ) output.
Is each of your bricks in a separate partition?
Can you send me the rebalance logs from all 3 nodes (offline if
you prefer)?
We can try using scripts to speed up the rebalance if you prefer.
Regards,
Nithya
Post by Rusty Bower
Thanks for the reply Nithya.
1. glusterfs 4.1.1
2. Volume Name: data
Type: Distribute
Volume ID: 294d95ce-0ff3-4df9-bd8c-a52fc50442ba
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Brick1: datanode01:/mnt/data/bricks/data
Brick2: datanode02:/mnt/data/bricks/data
Brick3: datanode03:/mnt/data/bricks/data
performance.readdir-ahead: on
3.
Node Rebalanced-files
size scanned failures skipped status run
time in h:m:s
--------- -----------
----------- ----------- ----------- -----------
------------ --------------
localhost 36822
11.3GB 50715 0 0 in progress
26:46:17
datanode02 0
0Bytes 2852 0 0 in progress
26:46:16
datanode03 3128
513.7MB 11442 0 3128 in progress
26:46:17
Estimated time left for rebalance to complete : > 2 months.
Please try again later.
volume rebalance: data: success
4. Directory structure is basically an rsync backup of some old
systems as well as all of my personal media. I can elaborate more, but it's
a pretty standard filesystem.
5. In some folders there might be up to like 12-15 levels of
directories (especially the backups)
6. I'm honestly not sure, I can try to scrounge this number up
7. My guess would be > 100k
8. Most files are pretty large (media files), but there's a lot
of small files (metadata and configuration files) as well
I've also appended a (moderately sanitized) snippet of the rebalance
log (let me know if you need more)
[2018-07-16 17:37:59.979003] I [MSGID: 0]
[dht-rebalance.c:1799:dht_migrate_file] 0-data-dht: destination
for file - /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml is changed to - data-client-2
[2018-07-16 17:38:00.004262] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2112002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:00.725582] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446597.869797, elapsed = 96526.000000
[2018-07-16 17:38:00.725641] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124092127
seconds, seconds left = 123995601
[2018-07-16 17:38:00.725709] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96526.00 secs
[2018-07-16 17:38:00.725738] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:02.769121] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108305980 tmp_cnt =
55419279917056,rate_processed=446588.616567, elapsed = 96528.000000
[2018-07-16 17:38:02.769207] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124094698
seconds, seconds left = 123998170
[2018-07-16 17:38:02.769263] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96528.00 secs
[2018-07-16 17:38:02.769286] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36876, size: 12270259289, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:03.410469] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:03.416127] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2040036.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.738885] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.745722] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201002.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:04.812368] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108308134 tmp_cnt =
55419279917056,rate_processed=446579.386035, elapsed = 96530.000000
[2018-07-16 17:38:04.812417] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124097263
seconds, seconds left = 124000733
[2018-07-16 17:38:04.812465] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96530.00 secs
[2018-07-16 17:38:04.812489] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36877, size: 12270261443, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:04.992413] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:04.994122] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9110012.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:06.855618] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446570.244043, elapsed = 96532.000000
[2018-07-16 17:38:06.855719] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124099804
seconds, seconds left = 124003272
[2018-07-16 17:38:06.855770] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96532.00 secs
[2018-07-16 17:38:06.855793] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
[2018-07-16 17:38:08.511064] I [dht-rebalance.c:1645:dht_migrate_file]
0-data-dht: /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/9201055.img.xml: attempting to move from
data-client-0 to data-client-2
[2018-07-16 17:38:08.533029] I [MSGID: 109022]
[dht-rebalance.c:2274:dht_migrate_file] 0-data-dht: completed
migration of /this/is/a/file/path/that/exis
ts/wz/wz/Npc.wz/2050000.img.xml from subvolume data-client-0 to
data-client-2
[2018-07-16 17:38:08.899708] I [dht-rebalance.c:4982:gf_defrag_get_estimates_based_on_size]
0-glusterfs: TIME: (size) total_processed=43108318798 tmp_cnt =
55419279917056,rate_processed=446560.991961, elapsed = 96534.000000
[2018-07-16 17:38:08.899791] I [dht-rebalance.c:5130:gf_defrag_status_get]
0-glusterfs: TIME: Estimated total time to complete (size)= 124102375
seconds, seconds left = 124005841
[2018-07-16 17:38:08.899842] I [MSGID: 109028]
Rebalance is in progress. Time taken is 96534.00 secs
[2018-07-16 17:38:08.899865] I [MSGID: 109028]
[dht-rebalance.c:5214:gf_defrag_status_get] 0-glusterfs: Files
migrated: 36879, size: 12270266602, lookups: 50715, failures: 0, skipped: 0
On Mon, Jul 16, 2018 at 7:37 AM, Nithya Balachandran <
Post by Nithya Balachandran
If possible, please send the rebalance logs as well.
On 16 July 2018 at 10:14, Nithya Balachandran <
Post by Nithya Balachandran
Hi Rusty,
1. The exact gluster version you are running
2. gluster volume info <volname>
3. gluster rebalance status
4. Information on the directory structure and file
locations on your volume.
5. How many levels of directories
6. How many files and directories in each level
7. How many directories and files in total (a rough estimate)
8. Average file size
Please note that having a rebalance running in the background
should not affect your volume access in any way. However I would like to
know why only 6000 files have been scanned in 6 hours.
Regards,
Nithya
Post by Rusty Bower
Hey folks,
I just added a new brick to my existing gluster volume, but *gluster
volume rebalance data status* is telling me the
following: Estimated time left for rebalance to complete : > 2 months.
Please try again later.
I already did a fix-mapping, but this thing is absolutely
crawling trying to rebalance everything (last estimate was ~40 years)
Any thoughts on if this is a bug, or ways to speed this up?
It's taking ~6 hours to scan 6000 files, which seems unreasonably slow.
Thanks
Rusty
_______________________________________________
Gluster-users mailing list
https://lists.gluster.org/mailman/listinfo/gluster-users
Continue reading on narkive:
Loading...