Discussion:
File Corruption with shards - 100% reproducable
(too old to reply)
Lindsay Mathieson
2015-11-04 22:23:25 UTC
Permalink
Gluster 3.7.5, gluster repos, on proxmox (debian 8)

I have an issue with VM images (qcow2) being corrupted.

- gluster replica 3, shards on, shard size = 256MB
- Gluster nodes are all also VM host nodes
- VM image mounted from qemu via gfapi

To reproduce
- Start VM
- live migrate it to another node
- VM will rapidly become unresponsive and have to be stopped
- attempting to restart the vm results in a "qcow2: Image is corrupt;
cannot be opened read/write" error.

I have never seen this before. 100% reproducible with shards on, never
happens with shards off.

I don't think this happens when using NFS to access the shard volume, I
suspect because with NFS it is still accessing the one node, whereas with
gfapi it's handed off to the node the VM is running on.
--
Lindsay
Krutika Dhananjay
2015-11-05 11:55:00 UTC
Permalink
Hi,

Although I do not have experience with VM live migration, IIUC, it is got to do with a different server (and as a result a new glusterfs client process) taking over the operations and mgmt of the VM.
If this is a correct assumption, then I think this could be the result of the same caching bug that I talked about sometime back in 3.7.5, which is fixed in 3.7.6.
The issue could cause the new client to not see the correct size and block count of the file, leading to errors in reads (perhaps triggered by the restart of the vm) and writes on the image.

-Krutika
----- Original Message -----
Sent: Thursday, November 5, 2015 3:53:25 AM
Subject: [Gluster-users] File Corruption with shards - 100% reproducable
Gluster 3.7.5, gluster repos, on proxmox (debian 8)
I have an issue with VM images (qcow2) being corrupted.
- gluster replica 3, shards on, shard size = 256MB
- Gluster nodes are all also VM host nodes
- VM image mounted from qemu via gfapi
To reproduce
- Start VM
- live migrate it to another node
- VM will rapidly become unresponsive and have to be stopped
- attempting to restart the vm results in a "qcow2: Image is corrupt; cannot
be opened read/write" error.
I have never seen this before. 100% reproducible with shards on, never
happens with shards off.
I don't think this happens when using NFS to access the shard volume, I
suspect because with NFS it is still accessing the one node, whereas with
gfapi it's handed off to the node the VM is running on.
--
Lindsay
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Lindsay Mathieson
2015-11-05 13:47:35 UTC
Permalink
Although I do not have experience with VM live migration, IIUC, it is got
to do with a different server (and as a result a new glusterfs client
process) taking over the operations and mgmt of the VM.
Thats sounds very plausible
If this is a correct assumption, then I think this could be the result of
the same caching bug that I talked about sometime back in 3.7.5, which is
fixed in 3.7.6.
The issue could cause the new client to not see the correct size and block
count of the file, leading to errors in reads (perhaps triggered by the
restart of the vm) and writes on the image.
Cool, I look fwd to testing that in 3.7.6, which I believe is due out next
week?

thanks,
--
Lindsay
Krutika Dhananjay
2015-11-06 05:35:27 UTC
Permalink
CC'ing Raghavendra Talur, who is managing the 3.7.6 release.

-Krutika

----- Original Message -----
Sent: Thursday, November 5, 2015 7:17:35 PM
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable
Post by Krutika Dhananjay
Although I do not have experience with VM live migration, IIUC, it is got
to
do with a different server (and as a result a new glusterfs client process)
taking over the operations and mgmt of the VM.
Thats sounds very plausible
Post by Krutika Dhananjay
If this is a correct assumption, then I think this could be the result of
the
same caching bug that I talked about sometime back in 3.7.5, which is fixed
in 3.7.6.
The issue could cause the new client to not see the correct size and block
count of the file, leading to errors in reads (perhaps triggered by the
restart of the vm) and writes on the image.
Cool, I look fwd to testing that in 3.7.6, which I believe is due out next
week?
thanks,
--
Lindsay
Krutika Dhananjay
2015-11-06 05:35:56 UTC
Permalink
CC'd him only now.

----- Original Message -----
Sent: Friday, November 6, 2015 11:05:27 AM
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable
CC'ing Raghavendra Talur, who is managing the 3.7.6 release.
-Krutika
----- Original Message -----
Sent: Thursday, November 5, 2015 7:17:35 PM
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable
Post by Krutika Dhananjay
Although I do not have experience with VM live migration, IIUC, it is got
to
do with a different server (and as a result a new glusterfs client process)
taking over the operations and mgmt of the VM.
Thats sounds very plausible
Post by Krutika Dhananjay
If this is a correct assumption, then I think this could be the result of
the
same caching bug that I talked about sometime back in 3.7.5, which is fixed
in 3.7.6.
The issue could cause the new client to not see the correct size and block
count of the file, leading to errors in reads (perhaps triggered by the
restart of the vm) and writes on the image.
Cool, I look fwd to testing that in 3.7.6, which I believe is due out next
week?
thanks,
--
Lindsay
Krutika Dhananjay
2015-11-14 07:30:57 UTC
Permalink
You should be able to find a file named group-virt.example under /etc/glusterfs/
Copy that as /var/lib/glusterd/virt.

Then execute `gluster volume set datastore1 group virt`.
Now with this configuration, could you try your test case and let me know whether the file corruption still exists?

-Krutika

----- Original Message -----
Sent: Saturday, November 14, 2015 10:51:26 AM
Subject: RE: [Gluster-users] File Corruption with shards - 100% reproducable
gluster volume set datastore1 group virt
Unable to open file '/var/lib/glusterd/groups/virt'. Error: No such file or
directory
Not sure I understand this one – couldn’t find any docs for it.
Sent from Mail for Windows 10
From: Krutika Dhananjay
Sent: Saturday, 14 November 2015 1:45 PM
To: Lindsay Mathieson
Cc: gluster-users
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable
The logs are at /var/log/glusterfs/<hyphenated-path-to-the-mountpoint>.log
OK. So what do you observe when you set group virt to on?
# gluster volume set <VOL> group virt
-Krutika
Sent: Friday, November 13, 2015 11:57:15 AM
Subject: Re: [Gluster-users] File Corruption with shards - 100%
reproducable
OK. What do the client logs say?
Dumb question - Which logs are those?
Could you share the exact steps to recreate this, and I will try it
locally
on my setup?
I'm running this on a 3 node proxmox cluster, which makes the vm creation &
migration easy to test.
- Create 3 node gluster datastore using proxmox vm host nodes
- Add gluster datastore as a storage dvice to proxmox
* qemu vms use the gfapi to access the datastore
* proxmox also adds a fuse mount for easy acces
- create a VM on the gluster storage, QCOW2 format. I just created a simple
debain Mate vm
- start the vm, open a console to it.
- live migrate the VM to a another node
- It will rapdily barf itself with disk errors
- stop the VM
- qemu will show file corruption (many many errors)
* qemu-img check <vm disk image>
* qemu-img info <vm disk image>
Repeating the process with sharding off has no errors.
Also, want to see the output of 'gluster volume info'.
I've trimmed settings down to a bare minimum. This is a test gluster
cluster
so I can do with it as I wish.
gluster volume info
Volume Name: datastore1
Type: Replicate
Volume ID: 238fddd0-a88c-4edb-8ac5-ef87c58682bf
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: vnb.proxmox.softlog:/mnt/ext4
Brick2: vng.proxmox.softlog:/mnt/ext4
Brick3: vna.proxmox.softlog:/mnt/ext4
performance.strict-write-ordering: on
performance.readdir-ahead: off
cluster.quorum-type: auto
features.shard: on
--
Lindsay
Lindsay Mathieson
2015-11-15 06:09:57 UTC
Permalink
So to start with, just disable performance.stat-prefetch and leave the
rest of the options as they were before and run the test case.
Yes, that seems to be the guilty party. When disabled I can freely migrate
VM's, emabled, things rapidly go pear shaped.
--
Lindsay
Krutika Dhananjay
2015-11-19 04:49:14 UTC
Permalink
Lindsay,

I wanted to ask you one more thing: specifically in VM workload with sharding, do you run into consistency issues with strict-write-ordering being off?
I remember suggesting that this option be enabled. But that was for plain dd on the mountpoint (and not inside the vm), where it was necessary.
I want to know if it is *really* necessary in VM workloads.

-Krutika

----- Original Message -----
Sent: Sunday, November 15, 2015 11:39:57 AM
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable
So to start with, just disable performance.stat-prefetch and leave the rest
of the options as they were before and run the test case.
Yes, that seems to be the guilty party. When disabled I can freely migrate
VM's, emabled, things rapidly go pear shaped.
--
Lindsay
Lindsay Mathieson
2015-11-20 05:15:14 UTC
Permalink
Post by Krutika Dhananjay
I wanted to ask you one more thing: specifically in VM workload with
sharding, do you run into consistency issues with
strict-write-ordering being off?
I remember suggesting that this option be enabled. But that was for
plain dd on the mountpoint (and not inside the vm), where it was
necessary.
I want to know if it is *really* necessary in VM workloads.
Hi Krutika, sorry for the delay, have been head down with work and sick
doggies :(

No I didn't need strict-write-ordering off/on, the VM's were fine. It
was only stat-prefetch that needed to be off.

One caveat - I started testing with 3.7.5, then upgraded to 3.7.6, but
didn't upgrade the op-version (always forget that).

Once I set the op version to 3.7.6 sharded volumes started reporting
correct file sizes (for new files) even with strict-write-ordering off.
However disk usage was still out by a lot.
Lindsay Mathieson
2015-11-20 06:26:04 UTC
Permalink
Post by Lindsay Mathieson
One caveat - I started testing with 3.7.5, then upgraded to 3.7.6, but
didn't upgrade the op-version (always forget that).
Once I set the op version to 3.7.6 sharded volumes started reporting
correct file sizes (for new files) even with strict-write-ordering
off. However disk usage was still out by a lot.
Ignore that, I just retested and file sizes (ls -l) were wildly out.

However the VM still migrates between nodes with with
strict-write-ordering off, no problems.

My apologies for the confusion.
Krutika Dhananjay
2015-11-23 09:44:46 UTC
Permalink
Thanks Lindsay for the confirmation.

The patch http://review.gluster.org/#/c/12717/ might just be the fix to the issue you ran into with performance.stat-prefetch on.
With this patch, it should be possible to enable stat-prefetch without running into any problems.

-Krutika

----- Original Message -----
Sent: Friday, November 20, 2015 11:56:04 AM
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable
Post by Lindsay Mathieson
One caveat - I started testing with 3.7.5, then upgraded to 3.7.6, but
didn't upgrade the op-version (always forget that).
Once I set the op version to 3.7.6 sharded volumes started reporting
correct file sizes (for new files) even with strict-write-ordering
off. However disk usage was still out by a lot.
Ignore that, I just retested and file sizes (ls -l) were wildly out.
However the VM still migrates between nodes with with
strict-write-ordering off, no problems.
My apologies for the confusion.
Lindsay Mathieson
2015-11-23 11:07:12 UTC
Permalink
Thanks Krutika, I’ll test that asap but will probably take me a day or two to get setup.

Should I apply the patch against the 3.7.6 tag, or is there a branch with it?

Sent from Mail for Windows 10



From: Krutika Dhananjay
Sent: Monday, 23 November 2015 7:44 PM
To: Lindsay Mathieson
Cc: gluster-users
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable


Thanks Lindsay for the confirmation.

The patch http://review.gluster.org/#/c/12717/ might just be the fix to the issue you ran into with performance.stat-prefetch on.
With this patch, it should be possible to enable stat-prefetch without running into any problems.
Krutika Dhananjay
2015-11-23 11:11:26 UTC
Permalink
The patch should make it into 3.7.7.

-Krutika
----- Original Message -----
Sent: Monday, November 23, 2015 4:37:12 PM
Subject: RE: [Gluster-users] File Corruption with shards - 100% reproducable
Thanks Krutika, I’ll test that asap but will probably take me a day or two to
get setup.
Should I apply the patch against the 3.7.6 tag, or is there a branch with it?
Sent from Mail for Windows 10
From: Krutika Dhananjay
Sent: Monday, 23 November 2015 7:44 PM
To: Lindsay Mathieson
Cc: gluster-users
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable
Thanks Lindsay for the confirmation.
The patch http://review.gluster.org/#/c/12717/ might just be the fix to the
issue you ran into with performance.stat-prefetch on.
With this patch, it should be possible to enable stat-prefetch without
running into any problems.
Lindsay Mathieson
2015-11-23 11:46:19 UTC
Permalink
Excellent, thanks. I imagine the clients would have to be rebuilt as well, which I can’t really do on my servers.

Sent from Mail for Windows 10



From: Krutika Dhananjay
Sent: Monday, 23 November 2015 9:11 PM
To: Lindsay Mathieson
Cc: gluster-users
Subject: Re: [Gluster-users] File Corruption with shards - 100% reproducable


The patch should make it into 3.7.7.

-Krutika
Lindsay Mathieson
2015-12-17 02:04:45 UTC
Permalink
Post by Krutika Dhananjay
The patch http://review.gluster.org/#/c/12717/ might just be the fix
to the issue you ran into with performance.stat-prefetch on.
With this patch, it should be possible to enable stat-prefetch without
running into any problems.
Is this in 3.7.6? because live migrations still cause corruption with
performance.stat-prefetch on :)
--
Lindsay Mathieson
Pranith Kumar Karampuri
2015-12-17 03:10:10 UTC
Permalink
Post by Lindsay Mathieson
Post by Krutika Dhananjay
The patch http://review.gluster.org/#/c/12717/ might just be the fix
to the issue you ran into with performance.stat-prefetch on.
With this patch, it should be possible to enable stat-prefetch
without running into any problems.
Is this in 3.7.6? because live migrations still cause corruption with
performance.stat-prefetch on :)
Hi Lindsay,
I see that this particular patch is merged after 3.7.6. into
3.7. branch. You should have this in 3.7.7

Pranith
Lindsay Mathieson
2015-12-17 03:19:09 UTC
Permalink
Post by Humble Devassy Chirammal
Hi Lindsay,
I see that this particular patch is merged after 3.7.6. into
3.7. branch. You should have this in 3.7.7
Pranith
Thanks
--
Lindsay Mathieson
Humble Devassy Chirammal
2015-11-13 10:01:33 UTC
Permalink
Hi Lindsay,
- start the vm, open a console to it.

- live migrate the VM to a another node

- It will rapdily barf itself with disk errors
Can you please share which 'cache' option ( none, writeback,
writethrough..etc) has been set for I/O on this problematic VM ? This
can be fetched either from process output or from xml schema of the VM.

--Humble


On Fri, Nov 13, 2015 at 11:57 AM, Lindsay Mathieson <
OK. What do the client logs say?
Dumb question - Which logs are those?
Could you share the exact steps to recreate this, and I will try it
locally on my setup?
I'm running this on a 3 node proxmox cluster, which makes the vm creation
& migration easy to test.
- Create 3 node gluster datastore using proxmox vm host nodes
- Add gluster datastore as a storage dvice to proxmox
* qemu vms use the gfapi to access the datastore
* proxmox also adds a fuse mount for easy acces
- create a VM on the gluster storage, QCOW2 format. I just created a
simple debain Mate vm
- start the vm, open a console to it.
- live migrate the VM to a another node
- It will rapdily barf itself with disk errors
- stop the VM
- qemu will show file corruption (many many errors)
* qemu-img check <vm disk image>
* qemu-img info <vm disk image>
Repeating the process with sharding off has no errors.
Also, want to see the output of 'gluster volume info'.
I've trimmed settings down to a bare minimum. This is a test gluster
cluster so I can do with it as I wish.
gluster volume info
Volume Name: datastore1
Type: Replicate
Volume ID: 238fddd0-a88c-4edb-8ac5-ef87c58682bf
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Brick1: vnb.proxmox.softlog:/mnt/ext4
Brick2: vng.proxmox.softlog:/mnt/ext4
Brick3: vna.proxmox.softlog:/mnt/ext4
performance.strict-write-ordering: on
performance.readdir-ahead: off
cluster.quorum-type: auto
features.shard: on
--
Lindsay
_______________________________________________
Gluster-users mailing list
http://www.gluster.org/mailman/listinfo/gluster-users
Lindsay Mathieson
2015-11-13 11:02:55 UTC
Permalink
On 13 November 2015 at 20:41, Humble Devassy Chirammal <
If possible, can you please check the result with 'cache=none' ?
Corrupted with that too I'm afraid.
--
Lindsay
Lindsay Mathieson
2015-11-13 11:10:32 UTC
Permalink
The command used to lauch the VM:

/usr/bin/kvm -id 910 -chardev
socket,id=qmp,path=/var/run/qemu-server/910.qmp,server,nowait -mon
chardev=qmp,mode=control -vnc
unix:/var/run/qemu-server/910.vnc,x509,password -pidfile
/var/run/qemu-server/910.pid -daemonize -smbios
type=1,uuid=f415789d-d92c-44ef-9bfc-44c448eff562 -name gluster-test -smp
2,sockets=1,cores=2,maxcpus=2 -nodefaults -boot
menu=on,strict=on,reboot-timeout=1000 -vga qxl -cpu
kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,-kvm_steal_time,enforce -m
2048 -k en-us -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e
-device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device
piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -spice
tls-port=61006,addr=localhost,tls-ciphers=DES-CBC3-SHA,seamless-migration=on
-device virtio-serial,id=spice,bus=pci.0,addr=0x9 -chardev
spicevmc,id=vdagent,name=vdagent -device
virtserialport,chardev=vdagent,name=com.redhat.spice.0 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -iscsi
initiator-name=iqn.1993-08.org.debian:01:cb8a28cc6f1e -drive
if=none,id=drive-ide2,media=cdrom,aio=threads -device
ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -drive
file=gluster://vnb.proxmox.softlog/datastore1/images/910/vm-910-disk-1.qcow2,if=none,id=drive-virtio0,cache=none,format=qcow2,aio=native,detect-zeroes=on
-device
virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100
-netdev
type=tap,id=net0,ifname=tap910i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on
-device
virtio-net-pci,mac=66:39:35:35:34:65,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
Post by Lindsay Mathieson
On 13 November 2015 at 20:41, Humble Devassy Chirammal <
If possible, can you please check the result with 'cache=none' ?
Corrupted with that too I'm afraid.
--
Lindsay
--
Lindsay
Lindsay Mathieson
2015-11-13 11:02:17 UTC
Permalink
On 13 November 2015 at 20:01, Humble Devassy Chirammal <
Post by Humble Devassy Chirammal
Can you please share which 'cache' option ( none, writeback,
writethrough..etc) has been set for I/O on this problematic VM ? This
can be fetched either from process output or from xml schema of the VM.
I tried it with Cache of and Cache = Sync. Both times the image was
corrupted.
--
Lindsay
Lindsay Mathieson
2015-11-14 10:09:46 UTC
Permalink
The logs are at /var/log/glusterfs/<hyphenated-path-to-the-mountpoint>.log
Attached are the logs for node vnb & vna.

I started the VM on vnb and migrated it to vng
vnb => vng

NB: The actual access of the VM image is not done via the fuse mount, but
using gfapi direct, so I'm not sure what log is relevant for that. Have
included them all.

Thanks.
--
Lindsay
Continue reading on narkive:
Loading...