Re: Random data corruption in VM, possibly caused by rbd

2012-06-15 Thread Stefan Majer
Hi, We had today a catastrophic fs corruption in one of our virtual machines, after fsck ~100MB was inside lost+found :-( So is think we hit the same bug (ceph-0.45.2, sparse rbd images) Is there any progress on this topic, or any hint how to help on this would be helpful. Greetings Stefan

Re: Random data corruption in VM, possibly caused by rbd

2012-06-15 Thread Josh Durgin
Short version: you should set 'filestore fiemap = false' for your osds. I was able to reproduce the crash with all the debugging I needed yesterday via test_librbd_fsx, and the problem looks like a bug in fiemap. Even though we call fsync before each fiemap call, we were getting different

Re: Random data corruption in VM, possibly caused by rbd

2012-06-15 Thread Josh Durgin
Since Guido was seeing this problem on btrfs as well, I'm going to try tracking down more precisely where it was introduced. Josh On 06/15/2012 08:38 AM, Josh Durgin wrote: Short version: you should set 'filestore fiemap = false' for your osds. I was able to reproduce the crash with all the

Re: Random data corruption in VM, possibly caused by rbd

2012-06-12 Thread Guido Winkelmann
Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil: If you can reproduce it with 'debug filestore = 20' too, that will be better, as it will tell us what the FIEMAP ioctl is returning. I ran another testrun with 'debug filestore = 20'. Also, if you can attach/post the contents of the

Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Guido Winkelmann
Am Samstag, 9. Juni 2012, 20:04:20 schrieb Sage Weil: On Fri, 8 Jun 2012, Guido Winkelmann wrote: Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin: On 06/08/2012 06:55 AM, Sage Weil wrote: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something

Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Guido Winkelmann
Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images. Kernel panic with INIT, seg-faults and other

Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Sage Weil
On Mon, 11 Jun 2012, Guido Winkelmann wrote: Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images.

Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Guido Winkelmann
Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil: On Mon, 11 Jun 2012, Guido Winkelmann wrote: Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil: On Fri, 8 Jun 2012, Oliver Francke wrote: Are you guys able to reproduce the corruption with 'debug osd = 20' and 'debug ms =

Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Sage Weil
On Mon, 11 Jun 2012, Guido Winkelmann wrote: Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil: On Mon, 11 Jun 2012, Guido Winkelmann wrote: Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil: On Fri, 8 Jun 2012, Oliver Francke wrote: Are you guys able to reproduce the

Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Josh Durgin
On 06/11/2012 10:07 AM, Guido Winkelmann wrote: Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil: On Mon, 11 Jun 2012, Guido Winkelmann wrote: Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil: On Fri, 8 Jun 2012, Oliver Francke wrote: Are you guys able to reproduce the corruption

Re: Random data corruption in VM, possibly caused by rbd

2012-06-09 Thread Sage Weil
On Fri, 8 Jun 2012, Guido Winkelmann wrote: Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin: On 06/08/2012 06:55 AM, Sage Weil wrote: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's.

Re: Random data corruption in VM, possibly caused by rbd

2012-06-09 Thread Sage Weil
On Sat, 9 Jun 2012, Sage Weil wrote: On Fri, 8 Jun 2012, Guido Winkelmann wrote: Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin: On 06/08/2012 06:55 AM, Sage Weil wrote: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I

Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Guido Winkelmann
Am Donnerstag, 7. Juni 2012, 12:48:05 schrieben Sie: On 06/07/2012 11:04 AM, Guido Winkelmann wrote: Hi, I'm using Ceph with RBD to provide network-transparent disk images for KVM- based virtual servers. The last two days, I've been hunting some weird elusive bug where data in the

Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Guido Winkelmann
Am Freitag, 8. Juni 2012, 14:55:44 schrieb Guido Winkelmann: I did not change anything else in the setup. In particular, the OSDs still use btrfs. One of the OSD has been restarted, though. I will run another test with a VM without rbd caching, to make sure it wasn't by random chance

Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Oliver Francke
Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images. Kernel panic with INIT, seg-faults and other funny stuff. Just added the rbd_cache=true in my config, voila. All is fast-n-up-n-running... All my

Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Sage Weil
On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images. Kernel panic with INIT, seg-faults and other funny stuff. Just added the rbd_cache=true in my config,

Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Josh Durgin
On 06/08/2012 06:55 AM, Sage Weil wrote: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images. Kernel panic with INIT, seg-faults and other funny stuff. Just added

Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Oliver Francke
Well then, quite busy, too with some other stuff, but... On 06/08/2012 04:50 PM, Josh Durgin wrote: On 06/08/2012 06:55 AM, Sage Weil wrote: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly

Re: Random data corruption in VM, possibly caused by rbd

2012-06-08 Thread Guido Winkelmann
Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin: On 06/08/2012 06:55 AM, Sage Weil wrote: On Fri, 8 Jun 2012, Oliver Francke wrote: Hi Guido, yeah, there is something weird going on. I just started to establish some test-VM's. Freshly imported from running *.qcow2 images.

Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Guido Winkelmann
Hi, I'm using Ceph with RBD to provide network-transparent disk images for KVM- based virtual servers. The last two days, I've been hunting some weird elusive bug where data in the virtual machines would be corrupted in weird ways. It usually manifests in files having some random data - usually

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Guido Winkelmann
Am Donnerstag, 7. Juni 2012, 20:18:52 schrieb Stefan Priebe: I think the test script would help a lot so others can test too. Okay, I've attached the program. It's barely 2 KB. You need Boost 1.45+, CMake 2.6+ and Crypto++ to compile it. Warning: This will fill up your harddisk completely,

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Oliver Francke
Hi Guido, unfortunately this sounds very familiar to me. We have been on a long road with similar weird errors. Our setup is something like start a couple of VM's ( qemu-*), let them create a 1G-file each and randomly seek and write 4MB blocks filled with md5sums of the block as payload, to be

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Josh Durgin
On 06/07/2012 11:04 AM, Guido Winkelmann wrote: Hi, I'm using Ceph with RBD to provide network-transparent disk images for KVM- based virtual servers. The last two days, I've been hunting some weird elusive bug where data in the virtual machines would be corrupted in weird ways. It usually

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Andrey Korolyov
Hmm, can`t reproduce that(phew!). Qemu-1.1-release, 0.47.2, guest/host mainly debian wheezy. Only one main difference with my setup from yours is a underlying fs - I`m tired of btrfs unpredictable load issues and moved back to xfs. BTW you calculate sha1 in test suite, not sha256 as you mentioned

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Guido Winkelmann
On Thursday 07 June 2012 23:54:04 Andrey Korolyov wrote: Hmm, can`t reproduce that(phew!). Qemu-1.1-release, 0.47.2, guest/host mainly debian wheezy. Only one main difference with my setup from yours is a underlying fs - I`m tired of btrfs unpredictable load issues and moved back to xfs. I

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Guido Winkelmann
On Thursday 07 June 2012 12:48:05 Josh Durgin wrote: On 06/07/2012 11:04 AM, Guido Winkelmann wrote: Hi, I'm using Ceph with RBD to provide network-transparent disk images for KVM- based virtual servers. The last two days, I've been hunting some weird elusive bug where data in the

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Marcus Sorensen
Maybe I did something wrong with your iotester, but I had to mkdir ./iotest to get it to run. I straced and found that it died on 'no such file'. On Thu, Jun 7, 2012 at 12:37 PM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: Am Donnerstag, 7. Juni 2012, 20:18:52 schrieb Stefan Priebe: I

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Guido Winkelmann
On Thursday 07 June 2012 15:53:18 Marcus Sorensen wrote: Maybe I did something wrong with your iotester, but I had to mkdir ./iotest to get it to run. I straced and found that it died on 'no such file'. It's a bit quick and dirty... You are supposed to pass the directory where it is to put

Re: Random data corruption in VM, possibly caused by rbd

2012-06-07 Thread Tommi Virtanen
On Thu, Jun 7, 2012 at 2:36 PM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: Again, I'll try that tomorrow. BTW, I could use some advice on how to go about that. Right I would stop one osd process (not the whole machine), reformat and remount its btrfs devices as XFS, delete the