Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-14 Thread Dr. David Alan Gilbert
* Stefan Priebe (s.pri...@profihost.ag) wrote:
 
 Am 13.02.2014 21:06, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe (s.pri...@profihost.ag) wrote:
 Am 10.02.2014 17:07, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe (s.pri...@profihost.ag) wrote:
 i could fix it by explicitly disable xbzrle - it seems its
 automatically on if i do not set the migration caps to false.
 
 So it seems to be a xbzrle bug.
 
 Stefan can you give me some more info on your hardware and
 migration setup;   that stressapptest (which is a really nice
 find!) really batters the memory and it means the migration
 isn't converging for me, so I'm curious what your setup is.
 
 That one is devlopment by google and known to me since a few years.
 Google has detected that memtest and co are not good enough to
 stress test memory.
 
 Hi Stefan,
I've just posted a patch to qemu-devel that fixes two bugs that
 we found; I've only tried a small stressapptest run and it seems
 to survive with them (where it didn't before);  you might like to try
 it if you're up for rebuilding qemu.
 
 It's the one entitled ' [PATCH] Fix two XBZRLE corruption issues'
 
 I'll try and get a larger run done myself, but I'd be interested to
 hear if it fixes it for you (or anyone else who hit the problem).
 
 Yes works fine - now no crash but it's sower than without XBZRLE ;-)
 
 Without XBZRLE: i needed migrate_downtime 4 around 60s
 With XBZRLE: i needed migrate_downtime 16 and 240s

Hmm; how did that compare with the previous (broken) with XBZRLE
time?   (i.e. was XBZRLE always slower for you?)

If you're driving this from the hmp/command interface then
the result of the
  info migrate

command at the end of each of those runs would be interesting.

Another thing you could try is changing the xbzrle_cache_zero_page
in arch_init.c that I added so it reads as:

static void xbzrle_cache_zero_page(ram_addr_t current_addr)
{
if (ram_bulk_stage || !migrate_use_xbzrle()) {
return;
}

if (!cache_is_cached(XBZRLE.cache, current_addr)) {
return;
}

/* We don't care if this fails to allocate a new cache page
 * as long as it updated an old one */
cache_insert(XBZRLE.cache, current_addr, ZERO_TARGET_PAGE);
}

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-13 Thread Dr. David Alan Gilbert
* Stefan Priebe (s.pri...@profihost.ag) wrote:
 Am 10.02.2014 17:07, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe (s.pri...@profihost.ag) wrote:
 i could fix it by explicitly disable xbzrle - it seems its
 automatically on if i do not set the migration caps to false.
 
 So it seems to be a xbzrle bug.
 
 Stefan can you give me some more info on your hardware and
 migration setup;   that stressapptest (which is a really nice
 find!) really batters the memory and it means the migration
 isn't converging for me, so I'm curious what your setup is.
 
 That one is devlopment by google and known to me since a few years.
 Google has detected that memtest and co are not good enough to
 stress test memory.

Hi Stefan,
  I've just posted a patch to qemu-devel that fixes two bugs that
we found; I've only tried a small stressapptest run and it seems
to survive with them (where it didn't before);  you might like to try
it if you're up for rebuilding qemu.

It's the one entitled ' [PATCH] Fix two XBZRLE corruption issues'

I'll try and get a larger run done myself, but I'd be interested to
hear if it fixes it for you (or anyone else who hit the problem).

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-13 Thread Stefan Priebe


Am 13.02.2014 21:06, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

Am 10.02.2014 17:07, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

i could fix it by explicitly disable xbzrle - it seems its
automatically on if i do not set the migration caps to false.

So it seems to be a xbzrle bug.


Stefan can you give me some more info on your hardware and
migration setup;   that stressapptest (which is a really nice
find!) really batters the memory and it means the migration
isn't converging for me, so I'm curious what your setup is.


That one is devlopment by google and known to me since a few years.
Google has detected that memtest and co are not good enough to
stress test memory.


Hi Stefan,
   I've just posted a patch to qemu-devel that fixes two bugs that
we found; I've only tried a small stressapptest run and it seems
to survive with them (where it didn't before);  you might like to try
it if you're up for rebuilding qemu.

It's the one entitled ' [PATCH] Fix two XBZRLE corruption issues'


Thanks!

Really would love to try them but nor google nor myself can find them.

http://osdir.com/ml/qemu-devel/2014-02/

Stefan



I'll try and get a larger run done myself, but I'd be interested to
hear if it fixes it for you (or anyone else who hit the problem).

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-13 Thread Stefan Priebe

got it here:
http://lists.nongnu.org/archive/html/qemu-devel/2014-02/msg02341.html

will try asap

Am 13.02.2014 21:06, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

Am 10.02.2014 17:07, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

i could fix it by explicitly disable xbzrle - it seems its
automatically on if i do not set the migration caps to false.

So it seems to be a xbzrle bug.


Stefan can you give me some more info on your hardware and
migration setup;   that stressapptest (which is a really nice
find!) really batters the memory and it means the migration
isn't converging for me, so I'm curious what your setup is.


That one is devlopment by google and known to me since a few years.
Google has detected that memtest and co are not good enough to
stress test memory.


Hi Stefan,
   I've just posted a patch to qemu-devel that fixes two bugs that
we found; I've only tried a small stressapptest run and it seems
to survive with them (where it didn't before);  you might like to try
it if you're up for rebuilding qemu.

It's the one entitled ' [PATCH] Fix two XBZRLE corruption issues'

I'll try and get a larger run done myself, but I'd be interested to
hear if it fixes it for you (or anyone else who hit the problem).

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-13 Thread Stefan Priebe


Am 13.02.2014 21:06, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

Am 10.02.2014 17:07, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

i could fix it by explicitly disable xbzrle - it seems its
automatically on if i do not set the migration caps to false.

So it seems to be a xbzrle bug.


Stefan can you give me some more info on your hardware and
migration setup;   that stressapptest (which is a really nice
find!) really batters the memory and it means the migration
isn't converging for me, so I'm curious what your setup is.


That one is devlopment by google and known to me since a few years.
Google has detected that memtest and co are not good enough to
stress test memory.


Hi Stefan,
   I've just posted a patch to qemu-devel that fixes two bugs that
we found; I've only tried a small stressapptest run and it seems
to survive with them (where it didn't before);  you might like to try
it if you're up for rebuilding qemu.

It's the one entitled ' [PATCH] Fix two XBZRLE corruption issues'

I'll try and get a larger run done myself, but I'd be interested to
hear if it fixes it for you (or anyone else who hit the problem).


Yes works fine - now no crash but it's sower than without XBZRLE ;-)

Without XBZRLE: i needed migrate_downtime 4 around 60s
With XBZRLE: i needed migrate_downtime 16 and 240s



Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-11 Thread Orit Wasserman

On 02/08/2014 09:23 PM, Stefan Priebe wrote:

i could fix it by explicitly disable xbzrle - it seems its automatically on if 
i do not set the migration caps to false.

So it seems to be a xbzrle bug.



XBZRLE is disabled by default (actually all capabilities are off by default)
What version of QEMU are you using that you need to disable it explicitly?
Maybe you run migration with XBZRLE and canceled it, so it stays on?

Orit


Stefan

Am 07.02.2014 21:10, schrieb Stefan Priebe:

Am 07.02.2014 21:02, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

anything i could try or debug? to help to find the problem?


I think the most useful would be to see if the problem is
a new problem in the 1.7 you're using or has existed
for a while; depending on the machine type you used, it might
be possible to load that image on an earlier (or newer) qemu
and try the same test, however if the problem doesn't
repeat reliably it can be hard.


I've seen this first with Qemu 1.5 but was not able to reproduce it for
month. 1.4 was working fine.


If you have any way of simplifying the configuration of the
VM it would be good; e.g. if you could get a failure on
something without graphics (-nographic) and USB.


Sadly not ;-(


Dave



Stefan

Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:

it's always the same pattern there are too many 0 instead of X.

only seen:

read:0x ... expected:0x

or

read:0x ... expected:0x

or

read:0xbf00bf00 ... expected:0xbfffbfff

or

read:0x ... expected:0xb5b5b5b5b5b5b5b5

no idea if this helps.

Stefan

Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:

Hi,
Am 07.02.2014 14:19, schrieb Paolo Bonzini:

Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:

first of all i've now a memory image of a VM where i can
reproduce it.


You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
But google stress test doesn't report any error until you start
migration _and_ it finishes?


Sorry no i meant i have a VM where i saved the memory to disk - so i
don't need to wait hours until i can reproduce as it does not happen
with a fresh started VM. So it's a state file i think.


Another test:

- start the VM with -S, migrate, do errors appear on the destination?


I started with -S and the errors appear AFTER resuming/unpause the VM.
So it is fine until i resume it on the new host.

Stefan




--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK








Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-11 Thread Stefan Priebe - Profihost AG

Am 11.02.2014 14:32, schrieb Orit Wasserman:
 On 02/08/2014 09:23 PM, Stefan Priebe wrote:
 i could fix it by explicitly disable xbzrle - it seems its
 automatically on if i do not set the migration caps to false.

 So it seems to be a xbzrle bug.

 
 XBZRLE is disabled by default (actually all capabilities are off by
 default)
 What version of QEMU are you using that you need to disable it explicitly?
 Maybe you run migration with XBZRLE and canceled it, so it stays on?

No real idea why this happens - but yes this seems to be a problem for me.

But the bug in XBZRLE is still there ;-)

Stefan

 Orit
 
 Stefan

 Am 07.02.2014 21:10, schrieb Stefan Priebe:
 Am 07.02.2014 21:02, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe (s.pri...@profihost.ag) wrote:
 anything i could try or debug? to help to find the problem?

 I think the most useful would be to see if the problem is
 a new problem in the 1.7 you're using or has existed
 for a while; depending on the machine type you used, it might
 be possible to load that image on an earlier (or newer) qemu
 and try the same test, however if the problem doesn't
 repeat reliably it can be hard.

 I've seen this first with Qemu 1.5 but was not able to reproduce it for
 month. 1.4 was working fine.

 If you have any way of simplifying the configuration of the
 VM it would be good; e.g. if you could get a failure on
 something without graphics (-nographic) and USB.

 Sadly not ;-(

 Dave


 Stefan

 Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:
 it's always the same pattern there are too many 0 instead of X.

 only seen:

 read:0x ... expected:0x

 or

 read:0x ... expected:0x

 or

 read:0xbf00bf00 ... expected:0xbfffbfff

 or

 read:0x ... expected:0xb5b5b5b5b5b5b5b5

 no idea if this helps.

 Stefan

 Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:
 Hi,
 Am 07.02.2014 14:19, schrieb Paolo Bonzini:
 Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:
 first of all i've now a memory image of a VM where i can
 reproduce it.

 You mean you start that VM with -incoming 'exec:cat
 /path/to/vm.img'?
 But google stress test doesn't report any error until you start
 migration _and_ it finishes?

 Sorry no i meant i have a VM where i saved the memory to disk - so i
 don't need to wait hours until i can reproduce as it does not happen
 with a fresh started VM. So it's a state file i think.

 Another test:

 - start the VM with -S, migrate, do errors appear on the
 destination?

 I started with -S and the errors appear AFTER resuming/unpause
 the VM.
 So it is fine until i resume it on the new host.

 Stefan


 -- 
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK


 



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-11 Thread Orit Wasserman

On 02/11/2014 03:33 PM, Stefan Priebe - Profihost AG wrote:


Am 11.02.2014 14:32, schrieb Orit Wasserman:

On 02/08/2014 09:23 PM, Stefan Priebe wrote:

i could fix it by explicitly disable xbzrle - it seems its
automatically on if i do not set the migration caps to false.

So it seems to be a xbzrle bug.



XBZRLE is disabled by default (actually all capabilities are off by
default)
What version of QEMU are you using that you need to disable it explicitly?
Maybe you run migration with XBZRLE and canceled it, so it stays on?


No real idea why this happens - but yes this seems to be a problem for me.



I checked upstream QEMU and it is still off by default (always been)


But the bug in XBZRLE is still there ;-)



We need to understand the exact scenario in order to understand the problem.

What exact version of Qemu are you using?
Can you try with the latest upstream version, there were some fixes to the
XBZRLE code?


Stefan


Orit


Stefan

Am 07.02.2014 21:10, schrieb Stefan Priebe:

Am 07.02.2014 21:02, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

anything i could try or debug? to help to find the problem?


I think the most useful would be to see if the problem is
a new problem in the 1.7 you're using or has existed
for a while; depending on the machine type you used, it might
be possible to load that image on an earlier (or newer) qemu
and try the same test, however if the problem doesn't
repeat reliably it can be hard.


I've seen this first with Qemu 1.5 but was not able to reproduce it for
month. 1.4 was working fine.


If you have any way of simplifying the configuration of the
VM it would be good; e.g. if you could get a failure on
something without graphics (-nographic) and USB.


Sadly not ;-(


Dave



Stefan

Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:

it's always the same pattern there are too many 0 instead of X.

only seen:

read:0x ... expected:0x

or

read:0x ... expected:0x

or

read:0xbf00bf00 ... expected:0xbfffbfff

or

read:0x ... expected:0xb5b5b5b5b5b5b5b5

no idea if this helps.

Stefan

Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:

Hi,
Am 07.02.2014 14:19, schrieb Paolo Bonzini:

Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:

first of all i've now a memory image of a VM where i can
reproduce it.


You mean you start that VM with -incoming 'exec:cat
/path/to/vm.img'?
But google stress test doesn't report any error until you start
migration _and_ it finishes?


Sorry no i meant i have a VM where i saved the memory to disk - so i
don't need to wait hours until i can reproduce as it does not happen
with a fresh started VM. So it's a state file i think.


Another test:

- start the VM with -S, migrate, do errors appear on the
destination?


I started with -S and the errors appear AFTER resuming/unpause
the VM.
So it is fine until i resume it on the new host.

Stefan




--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK










Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-11 Thread Stefan Priebe - Profihost AG
Am 11.02.2014 14:45, schrieb Orit Wasserman:
 On 02/11/2014 03:33 PM, Stefan Priebe - Profihost AG wrote:

 Am 11.02.2014 14:32, schrieb Orit Wasserman:
 On 02/08/2014 09:23 PM, Stefan Priebe wrote:
 i could fix it by explicitly disable xbzrle - it seems its
 automatically on if i do not set the migration caps to false.

 So it seems to be a xbzrle bug.


 XBZRLE is disabled by default (actually all capabilities are off by
 default)
 What version of QEMU are you using that you need to disable it
 explicitly?
 Maybe you run migration with XBZRLE and canceled it, so it stays on?

 No real idea why this happens - but yes this seems to be a problem for
 me.

 
 I checked upstream QEMU and it is still off by default (always been)

May be i had it on in the past and the VM was still running from an
older migration.

 But the bug in XBZRLE is still there ;-)

 
 We need to understand the exact scenario in order to understand the
 problem.
 
 What exact version of Qemu are you using?

Qemu 1.7.0

 Can you try with the latest upstream version, there were some fixes to the
 XBZRLE code?

Sadly not - i have some custom patches (not related to xbzrle) which
won't apply to current upstream.

But i could cherry-pick the ones you have in mind - if you give me the
commit ids.

Stefan

 Stefan

 Orit

 Stefan

 Am 07.02.2014 21:10, schrieb Stefan Priebe:
 Am 07.02.2014 21:02, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe (s.pri...@profihost.ag) wrote:
 anything i could try or debug? to help to find the problem?

 I think the most useful would be to see if the problem is
 a new problem in the 1.7 you're using or has existed
 for a while; depending on the machine type you used, it might
 be possible to load that image on an earlier (or newer) qemu
 and try the same test, however if the problem doesn't
 repeat reliably it can be hard.

 I've seen this first with Qemu 1.5 but was not able to reproduce it
 for
 month. 1.4 was working fine.

 If you have any way of simplifying the configuration of the
 VM it would be good; e.g. if you could get a failure on
 something without graphics (-nographic) and USB.

 Sadly not ;-(

 Dave


 Stefan

 Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:
 it's always the same pattern there are too many 0 instead of X.

 only seen:

 read:0x ... expected:0x

 or

 read:0x ... expected:0x

 or

 read:0xbf00bf00 ... expected:0xbfffbfff

 or

 read:0x ... expected:0xb5b5b5b5b5b5b5b5

 no idea if this helps.

 Stefan

 Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:
 Hi,
 Am 07.02.2014 14:19, schrieb Paolo Bonzini:
 Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:
 first of all i've now a memory image of a VM where i can
 reproduce it.

 You mean you start that VM with -incoming 'exec:cat
 /path/to/vm.img'?
 But google stress test doesn't report any error until you start
 migration _and_ it finishes?

 Sorry no i meant i have a VM where i saved the memory to disk -
 so i
 don't need to wait hours until i can reproduce as it does not
 happen
 with a fresh started VM. So it's a state file i think.

 Another test:

 - start the VM with -S, migrate, do errors appear on the
 destination?

 I started with -S and the errors appear AFTER resuming/unpause
 the VM.
 So it is fine until i resume it on the new host.

 Stefan


 -- 
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



 



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-10 Thread Dr. David Alan Gilbert
* Stefan Priebe (s.pri...@profihost.ag) wrote:
 i could fix it by explicitly disable xbzrle - it seems its
 automatically on if i do not set the migration caps to false.
 
 So it seems to be a xbzrle bug.

Ah right, yes that would make sense for the type of errors
you're seeing, and does make it easier to tie down.

Dave

 
 Stefan
 
 Am 07.02.2014 21:10, schrieb Stefan Priebe:
 Am 07.02.2014 21:02, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe (s.pri...@profihost.ag) wrote:
 anything i could try or debug? to help to find the problem?
 
 I think the most useful would be to see if the problem is
 a new problem in the 1.7 you're using or has existed
 for a while; depending on the machine type you used, it might
 be possible to load that image on an earlier (or newer) qemu
 and try the same test, however if the problem doesn't
 repeat reliably it can be hard.
 
 I've seen this first with Qemu 1.5 but was not able to reproduce it for
 month. 1.4 was working fine.
 
 If you have any way of simplifying the configuration of the
 VM it would be good; e.g. if you could get a failure on
 something without graphics (-nographic) and USB.
 
 Sadly not ;-(
 
 Dave
 
 
 Stefan
 
 Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:
 it's always the same pattern there are too many 0 instead of X.
 
 only seen:
 
 read:0x ... expected:0x
 
 or
 
 read:0x ... expected:0x
 
 or
 
 read:0xbf00bf00 ... expected:0xbfffbfff
 
 or
 
 read:0x ... expected:0xb5b5b5b5b5b5b5b5
 
 no idea if this helps.
 
 Stefan
 
 Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:
 Hi,
 Am 07.02.2014 14:19, schrieb Paolo Bonzini:
 Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:
 first of all i've now a memory image of a VM where i can
 reproduce it.
 
 You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
 But google stress test doesn't report any error until you start
 migration _and_ it finishes?
 
 Sorry no i meant i have a VM where i saved the memory to disk - so i
 don't need to wait hours until i can reproduce as it does not happen
 with a fresh started VM. So it's a state file i think.
 
 Another test:
 
 - start the VM with -S, migrate, do errors appear on the destination?
 
 I started with -S and the errors appear AFTER resuming/unpause the VM.
 So it is fine until i resume it on the new host.
 
 Stefan
 
 
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-10 Thread Dr. David Alan Gilbert
* Stefan Priebe (s.pri...@profihost.ag) wrote:
 i could fix it by explicitly disable xbzrle - it seems its
 automatically on if i do not set the migration caps to false.
 
 So it seems to be a xbzrle bug.

Stefan can you give me some more info on your hardware and
migration setup;   that stressapptest (which is a really nice
find!) really batters the memory and it means the migration
isn't converging for me, so I'm curious what your setup is.

  What CPU have you got?
  How many cores are you giving each guest?
  What network technology are you migrating over?
  Other than xbzrle what else do you have enabled?
  How long is the migrate taking for you?

Thanks,

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-10 Thread Stefan Priebe

Am 10.02.2014 17:07, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

i could fix it by explicitly disable xbzrle - it seems its
automatically on if i do not set the migration caps to false.

So it seems to be a xbzrle bug.


Stefan can you give me some more info on your hardware and
migration setup;   that stressapptest (which is a really nice
find!) really batters the memory and it means the migration
isn't converging for me, so I'm curious what your setup is.


That one is devlopment by google and known to me since a few years. 
Google has detected that memtest and co are not good enough to stress 
test memory.



   What CPU have you got?


Dual Xeon E5-2695v2


   How many cores are you giving each guest?


16


   What network technology are you migrating over?


10Gb/s


   Other than xbzrle what else do you have enabled?


nothing


   How long is the migrate taking for you?


with migration_downtime = 4s around 10s

Stefan


Thanks,

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-08 Thread Stefan Priebe
i could fix it by explicitly disable xbzrle - it seems its automatically 
on if i do not set the migration caps to false.


So it seems to be a xbzrle bug.

Stefan

Am 07.02.2014 21:10, schrieb Stefan Priebe:

Am 07.02.2014 21:02, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

anything i could try or debug? to help to find the problem?


I think the most useful would be to see if the problem is
a new problem in the 1.7 you're using or has existed
for a while; depending on the machine type you used, it might
be possible to load that image on an earlier (or newer) qemu
and try the same test, however if the problem doesn't
repeat reliably it can be hard.


I've seen this first with Qemu 1.5 but was not able to reproduce it for
month. 1.4 was working fine.


If you have any way of simplifying the configuration of the
VM it would be good; e.g. if you could get a failure on
something without graphics (-nographic) and USB.


Sadly not ;-(


Dave



Stefan

Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:

it's always the same pattern there are too many 0 instead of X.

only seen:

read:0x ... expected:0x

or

read:0x ... expected:0x

or

read:0xbf00bf00 ... expected:0xbfffbfff

or

read:0x ... expected:0xb5b5b5b5b5b5b5b5

no idea if this helps.

Stefan

Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:

Hi,
Am 07.02.2014 14:19, schrieb Paolo Bonzini:

Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:

first of all i've now a memory image of a VM where i can
reproduce it.


You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
But google stress test doesn't report any error until you start
migration _and_ it finishes?


Sorry no i meant i have a VM where i saved the memory to disk - so i
don't need to wait hours until i can reproduce as it does not happen
with a fresh started VM. So it's a state file i think.


Another test:

- start the VM with -S, migrate, do errors appear on the destination?


I started with -S and the errors appear AFTER resuming/unpause the VM.
So it is fine until i resume it on the new host.

Stefan




--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Alexandre DERUMIER

do you use xbzrle for live migration ?



- Mail original -

De: Stefan Priebe s.pri...@profihost.ag
À: Dr. David Alan Gilbert dgilb...@redhat.com
Cc: Alexandre DERUMIER aderum...@odiso.com, qemu-devel 
qemu-devel@nongnu.org
Envoyé: Jeudi 6 Février 2014 21:00:27
Objet: Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap 
file entry

Hi,
Am 06.02.2014 20:51, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 some more things which happen during migration:

 php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0
 error 4 in php-cgi[40+6d7000]

 php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20
 error 4 in ZendOptimizer.so[7f1fb8e71000+147000]

 cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp
 7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000]

 OK, so lets just assume some part of memory (or CPU state, or memory
 loaded off disk...)

 You said before that it was happening on a 32GB image - is it *only*
 happening on a 32GB or bigger VM, or is it just more likely?

Not image, memory. I've only seen this with vms having more than 16GB or
32GB memory. But maybe this also indicates that just the migration takes
longer.

 I think you also said you were using 1.7; have you tried an older
 version - i.e. is this a regression in 1.7 or don't we know?
Don't know. Sadly i cannot reproduce this with test VMs only with
production ones.

Stefan

 Dave
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Am 07.02.2014 09:15, schrieb Alexandre DERUMIER:
 
 do you use xbzrle for live migration ?

no - i'm really stucked right now with this. Biggest problem i can't
reproduce with test machines ;-(

Stefan


 
 - Mail original - 
 
 De: Stefan Priebe s.pri...@profihost.ag 
 À: Dr. David Alan Gilbert dgilb...@redhat.com 
 Cc: Alexandre DERUMIER aderum...@odiso.com, qemu-devel 
 qemu-devel@nongnu.org 
 Envoyé: Jeudi 6 Février 2014 21:00:27 
 Objet: Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap 
 file entry 
 
 Hi, 
 Am 06.02.2014 20:51, schrieb Dr. David Alan Gilbert: 
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote: 
 some more things which happen during migration: 

 php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0 
 error 4 in php-cgi[40+6d7000] 

 php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20 
 error 4 in ZendOptimizer.so[7f1fb8e71000+147000] 

 cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp 
 7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000] 

 OK, so lets just assume some part of memory (or CPU state, or memory 
 loaded off disk...) 

 You said before that it was happening on a 32GB image - is it *only* 
 happening on a 32GB or bigger VM, or is it just more likely? 
 
 Not image, memory. I've only seen this with vms having more than 16GB or 
 32GB memory. But maybe this also indicates that just the migration takes 
 longer. 
 
 I think you also said you were using 1.7; have you tried an older 
 version - i.e. is this a regression in 1.7 or don't we know? 
 Don't know. Sadly i cannot reproduce this with test VMs only with 
 production ones. 
 
 Stefan 
 
 Dave 
 -- 
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK 




Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Dr. David Alan Gilbert
* Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Am 07.02.2014 09:15, schrieb Alexandre DERUMIER:
  
  do you use xbzrle for live migration ?
 
 no - i'm really stucked right now with this. Biggest problem i can't
 reproduce with test machines ;-(

Only being able to test on your production VMs isn't fun;
is it possible or you to run an extra program on these VMs - e.g.
if we came up with a simple (userland) memory test?

Dave

 
 Stefan
 
 
  
  - Mail original - 
  
  De: Stefan Priebe s.pri...@profihost.ag 
  À: Dr. David Alan Gilbert dgilb...@redhat.com 
  Cc: Alexandre DERUMIER aderum...@odiso.com, qemu-devel 
  qemu-devel@nongnu.org 
  Envoyé: Jeudi 6 Février 2014 21:00:27 
  Objet: Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad 
  swap file entry 
  
  Hi, 
  Am 06.02.2014 20:51, schrieb Dr. David Alan Gilbert: 
  * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote: 
  some more things which happen during migration: 
 
  php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0 
  error 4 in php-cgi[40+6d7000] 
 
  php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20 
  error 4 in ZendOptimizer.so[7f1fb8e71000+147000] 
 
  cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp 
  7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000] 
 
  OK, so lets just assume some part of memory (or CPU state, or memory 
  loaded off disk...) 
 
  You said before that it was happening on a 32GB image - is it *only* 
  happening on a 32GB or bigger VM, or is it just more likely? 
  
  Not image, memory. I've only seen this with vms having more than 16GB or 
  32GB memory. But maybe this also indicates that just the migration takes 
  longer. 
  
  I think you also said you were using 1.7; have you tried an older 
  version - i.e. is this a regression in 1.7 or don't we know? 
  Don't know. Sadly i cannot reproduce this with test VMs only with 
  production ones. 
  
  Stefan 
  
  Dave 
  -- 
  Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK 
 
 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG

Am 07.02.2014 10:15, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Am 07.02.2014 09:15, schrieb Alexandre DERUMIER:

 do you use xbzrle for live migration ?

 no - i'm really stucked right now with this. Biggest problem i can't
 reproduce with test machines ;-(
 
 Only being able to test on your production VMs isn't fun;
 is it possible or you to run an extra program on these VMs - e.g.
 if we came up with a simple (userland) memory test?

You mean to reproduce?

I already tried https://code.google.com/p/stressapptest/ while migrating
on a test VM but this works fine.

I also tried running mysql bench while migrating on a test vm and this
works too ;-(

Stefan

 Dave
 

 Stefan



 - Mail original - 

 De: Stefan Priebe s.pri...@profihost.ag 
 À: Dr. David Alan Gilbert dgilb...@redhat.com 
 Cc: Alexandre DERUMIER aderum...@odiso.com, qemu-devel 
 qemu-devel@nongnu.org 
 Envoyé: Jeudi 6 Février 2014 21:00:27 
 Objet: Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad 
 swap file entry 

 Hi, 
 Am 06.02.2014 20:51, schrieb Dr. David Alan Gilbert: 
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote: 
 some more things which happen during migration: 

 php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0 
 error 4 in php-cgi[40+6d7000] 

 php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20 
 error 4 in ZendOptimizer.so[7f1fb8e71000+147000] 

 cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp 
 7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000] 

 OK, so lets just assume some part of memory (or CPU state, or memory 
 loaded off disk...) 

 You said before that it was happening on a 32GB image - is it *only* 
 happening on a 32GB or bigger VM, or is it just more likely? 

 Not image, memory. I've only seen this with vms having more than 16GB or 
 32GB memory. But maybe this also indicates that just the migration takes 
 longer. 

 I think you also said you were using 1.7; have you tried an older 
 version - i.e. is this a regression in 1.7 or don't we know? 
 Don't know. Sadly i cannot reproduce this with test VMs only with 
 production ones. 

 Stefan 

 Dave 
 -- 
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK 


 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
 



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Marcin Gibuła

do you use xbzrle for live migration ?


no - i'm really stucked right now with this. Biggest problem i can't
reproduce with test machines ;-(


Only being able to test on your production VMs isn't fun;
is it possible or you to run an extra program on these VMs - e.g.
if we came up with a simple (userland) memory test?


You mean to reproduce?

I already tried https://code.google.com/p/stressapptest/ while migrating
on a test VM but this works fine.

I also tried running mysql bench while migrating on a test vm and this
works too ;-(


Have you tried to let test VM run idle for some time before migrating? 
(like 18-24 hours)


Having the same (or very similar) problem, I had bigger luck with 
reproducing it by not using freshly started VMs.


--
mg



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Hi,
Am 07.02.2014 10:29, schrieb Marcin Gibuła:
 do you use xbzrle for live migration ?

 no - i'm really stucked right now with this. Biggest problem i can't
 reproduce with test machines ;-(

 Only being able to test on your production VMs isn't fun;
 is it possible or you to run an extra program on these VMs - e.g.
 if we came up with a simple (userland) memory test?

 You mean to reproduce?

 I already tried https://code.google.com/p/stressapptest/ while migrating
 on a test VM but this works fine.

 I also tried running mysql bench while migrating on a test vm and this
 works too ;-(
 
 Have you tried to let test VM run idle for some time before migrating?
 (like 18-24 hours)
 
 Having the same (or very similar) problem, I had bigger luck with
 reproducing it by not using freshly started VMs.

no i haven't tried this will do so soon.

Stefan



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Dr. David Alan Gilbert
* Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 
 Am 07.02.2014 10:15, schrieb Dr. David Alan Gilbert:
  * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
  Am 07.02.2014 09:15, schrieb Alexandre DERUMIER:
 
  do you use xbzrle for live migration ?
 
  no - i'm really stucked right now with this. Biggest problem i can't
  reproduce with test machines ;-(
  
  Only being able to test on your production VMs isn't fun;
  is it possible or you to run an extra program on these VMs - e.g.
  if we came up with a simple (userland) memory test?
 
 You mean to reproduce?

I'm more interested in seeing what type of corruption is happening;
if you've got a test VM that corrupts memory and we can run a program
in that vm that writes a known pattern into memory and checks it
then see what changed after migration, it might give a clue.

But obviously this would only be of any use if run on the VM that actually
fails.

 I already tried https://code.google.com/p/stressapptest/ while migrating
 on a test VM but this works fine.
 
 I also tried running mysql bench while migrating on a test vm and this
 works too ;-(


Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG

Am 07.02.2014 10:31, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:

 Am 07.02.2014 10:15, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Am 07.02.2014 09:15, schrieb Alexandre DERUMIER:

 do you use xbzrle for live migration ?

 no - i'm really stucked right now with this. Biggest problem i can't
 reproduce with test machines ;-(

 Only being able to test on your production VMs isn't fun;
 is it possible or you to run an extra program on these VMs - e.g.
 if we came up with a simple (userland) memory test?

 You mean to reproduce?
 
 I'm more interested in seeing what type of corruption is happening;
 if you've got a test VM that corrupts memory and we can run a program
 in that vm that writes a known pattern into memory and checks it
 then see what changed after migration, it might give a clue.
 
 But obviously this would only be of any use if run on the VM that actually
 fails.

Right that makes sense - sadly i still don't know how to reproduce? Any
app ideas i can try?


 I already tried https://code.google.com/p/stressapptest/ while migrating
 on a test VM but this works fine.

 I also tried running mysql bench while migrating on a test vm and this
 works too ;-(
 
 
 Dave
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
 



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Marcin Gibuła

You mean to reproduce?


I'm more interested in seeing what type of corruption is happening;
if you've got a test VM that corrupts memory and we can run a program
in that vm that writes a known pattern into memory and checks it
then see what changed after migration, it might give a clue.

But obviously this would only be of any use if run on the VM that actually
fails.


Hi,

Seeing similar issue in my company I would be happy to run such tests. 
Do you have any test suite I could run or some leads how to write it?


--
mg



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Hi,

i was able to reproduce with a longer running test VM running the google
stress test.

And it happens exacly when the migration finishes it does not happen
while the migration is running.

Google Stress Output displays Memory errors:

Page Error: miscompare on CPU 5(0x) at 0x7f52431341c0(0x0:DIMM
Unknown): read:0x004000bf, reread:0x004000bf
expected:0x0040ffbf
Report Error: miscompare : DIMM Unknown : 1 : 571s
Page Error: miscompare on CPU 5(0x) at 0x7f52431341c8(0x0:DIMM
Unknown): read:0x002000df, reread:0x002000df
expected:0x0020ffdf
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34020(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34028(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34060(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34068(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340a0(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340a8(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340e0(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340e8(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34120(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34128(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34160(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34168(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341a0(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341a8(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341e0(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341e8(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34220(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34228(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34260(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34268(0x0:DIMM
Unknown): read:0x, reread:0x
expected:0x
Report Error: miscompare : DIMM Unknown : 1 : 571s
Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c342a0(0x0:DIMM

Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Dr. David Alan Gilbert
* Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Hi,
 
 i was able to reproduce with a longer running test VM running the google
 stress test.

Hmm that's quite a fun set of differences; I think I'd like
to understand whether the pattern is related to the pattern of what
the test is doing.

Can you just give an explanation of exactly how you ran that test?
   What you installed, how exactly you ran it.

Then Marcin and I can try and replicate it.

Dave

 And it happens exacly when the migration finishes it does not happen
 while the migration is running.
 
 Google Stress Output displays Memory errors:
 
 Page Error: miscompare on CPU 5(0x) at 0x7f52431341c0(0x0:DIMM
 Unknown): read:0x004000bf, reread:0x004000bf
 expected:0x0040ffbf
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Page Error: miscompare on CPU 5(0x) at 0x7f52431341c8(0x0:DIMM
 Unknown): read:0x002000df, reread:0x002000df
 expected:0x0020ffdf
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34020(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34028(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34060(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34068(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340a0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340a8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340e0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340e8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34120(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34128(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34160(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34168(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341a0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341a8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341e0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341e8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34220(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34228(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on 

Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Hi,
Am 07.02.2014 13:21, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Hi,

 i was able to reproduce with a longer running test VM running the google
 stress test.
 
 Hmm that's quite a fun set of differences; I think I'd like
 to understand whether the pattern is related to the pattern of what
 the test is doing.
 
 Can you just give an explanation of exactly how you ran that test?
What you installed, how exactly you ran it.

While migrating i've still no reliable way to reproduce but i'll try to.

I can force the problem without migration when start with:
bin/stressapptest -s 3600 -m 20 -i 20 -C 20 --force_errors

= inject false errors to test error handling

Stefan

 Then Marcin and I can try and replicate it.
 
 Dave
 
 And it happens exacly when the migration finishes it does not happen
 while the migration is running.

 Google Stress Output displays Memory errors:
 
 Page Error: miscompare on CPU 5(0x) at 0x7f52431341c0(0x0:DIMM
 Unknown): read:0x004000bf, reread:0x004000bf
 expected:0x0040ffbf
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Page Error: miscompare on CPU 5(0x) at 0x7f52431341c8(0x0:DIMM
 Unknown): read:0x002000df, reread:0x002000df
 expected:0x0020ffdf
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34020(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34028(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34060(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34068(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340a0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340a8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340e0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c340e8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34120(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34128(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34160(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34168(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341a0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341a8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341e0(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c341e8(0x0:DIMM
 Unknown): read:0x, reread:0x
 expected:0x
 Report Error: miscompare : DIMM Unknown : 1 : 571s
 Hardware Error: miscompare on CPU 3(0x) at 0x7f4fd5c34220(0x0:DIMM
 Unknown): read:0x, reread:0x
 

Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Paolo Bonzini

Il 07/02/2014 13:30, Stefan Priebe - Profihost AG ha scritto:

 i was able to reproduce with a longer running test VM running the google
 stress test.

Hmm that's quite a fun set of differences; I think I'd like
to understand whether the pattern is related to the pattern of what
the test is doing.


Stefan, can you try to reproduce it:

- with Unix migration between two QEMUs on the same host

- with different hosts

- with a different network (e.g. just a cross cable between two machines)

Paolo



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Hi,

Am 07.02.2014 13:44, schrieb Paolo Bonzini:
 Il 07/02/2014 13:30, Stefan Priebe - Profihost AG ha scritto:
  i was able to reproduce with a longer running test VM running the
 google
  stress test.

 Hmm that's quite a fun set of differences; I think I'd like
 to understand whether the pattern is related to the pattern of what
 the test is doing.
 
 Stefan, can you try to reproduce it:

first of all i've now a memory image of a VM where i can reproduce it.
reproducing does NOT work if i boot the VM freshly i need to let it run
for some hours.

Then just when the migration finishes there is a short time frame where
the google stress app reports memory errors than when the migration
finishes it runs fine again.

It seems to me it is related to pause and unpause/resume?

 - with Unix migration between two QEMUs on the same host
now tested = same issue

 - with different hosts
already tested = same issue

 - with a different network (e.g. just a cross cable between two machines)
already tested = same issue

Greets,
Stefan



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Dr. David Alan Gilbert
* Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Hi,
 
 Am 07.02.2014 13:44, schrieb Paolo Bonzini:
  Il 07/02/2014 13:30, Stefan Priebe - Profihost AG ha scritto:
   i was able to reproduce with a longer running test VM running the
  google
   stress test.
 
  Hmm that's quite a fun set of differences; I think I'd like
  to understand whether the pattern is related to the pattern of what
  the test is doing.
  
  Stefan, can you try to reproduce it:
 
 first of all i've now a memory image of a VM where i can reproduce it.
 reproducing does NOT work if i boot the VM freshly i need to let it run
 for some hours.
 
 Then just when the migration finishes there is a short time frame where
 the google stress app reports memory errors than when the migration
 finishes it runs fine again.
 
 It seems to me it is related to pause and unpause/resume?

But do you have to pause/resume it to cause the error? Have you got cases
where you boot it and then leave it running for a few hours and then it 
fails if you migrate it?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Hi,

Am 07.02.2014 14:08, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Hi,

 Am 07.02.2014 13:44, schrieb Paolo Bonzini:
 Il 07/02/2014 13:30, Stefan Priebe - Profihost AG ha scritto:
 i was able to reproduce with a longer running test VM running the
 google
 stress test.

 Hmm that's quite a fun set of differences; I think I'd like
 to understand whether the pattern is related to the pattern of what
 the test is doing.

 Stefan, can you try to reproduce it:

 first of all i've now a memory image of a VM where i can reproduce it.
 reproducing does NOT work if i boot the VM freshly i need to let it run
 for some hours.

 Then just when the migration finishes there is a short time frame where
 the google stress app reports memory errors than when the migration
 finishes it runs fine again.

 It seems to me it is related to pause and unpause/resume?
 
 But do you have to pause/resume it to cause the error? Have you got cases
 where you boot it and then leave it running for a few hours and then it 
 fails if you migrate it?

Yes but isn't migration always a pause / unpause at the end? I thought
migration_downtime is the value a very small pause unpause is allowed.

Stefan




Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Dr. David Alan Gilbert
* Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Hi,
 
 Am 07.02.2014 14:08, schrieb Dr. David Alan Gilbert:
  * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:

  first of all i've now a memory image of a VM where i can reproduce it.
  reproducing does NOT work if i boot the VM freshly i need to let it run
  for some hours.
 
  Then just when the migration finishes there is a short time frame where
  the google stress app reports memory errors than when the migration
  finishes it runs fine again.
 
  It seems to me it is related to pause and unpause/resume?
  
  But do you have to pause/resume it to cause the error? Have you got cases
  where you boot it and then leave it running for a few hours and then it 
  fails if you migrate it?
 
 Yes but isn't migration always a pause / unpause at the end? I thought
 migration_downtime is the value a very small pause unpause is allowed.

There's a heck of a lot of other stuff that goes on in migration, and that
downtime isn't quite the same.

If it can be reproduced with just suspend/resume stuff then that's a different
place to start looking than if it's migration only.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Hi,

Am 07.02.2014 14:15, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 Hi,

 Am 07.02.2014 14:08, schrieb Dr. David Alan Gilbert:
 * Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 
 first of all i've now a memory image of a VM where i can reproduce it.
 reproducing does NOT work if i boot the VM freshly i need to let it run
 for some hours.

 Then just when the migration finishes there is a short time frame where
 the google stress app reports memory errors than when the migration
 finishes it runs fine again.

 It seems to me it is related to pause and unpause/resume?

 But do you have to pause/resume it to cause the error? Have you got cases
 where you boot it and then leave it running for a few hours and then it 
 fails if you migrate it?

 Yes but isn't migration always a pause / unpause at the end? I thought
 migration_downtime is the value a very small pause unpause is allowed.
 
 There's a heck of a lot of other stuff that goes on in migration, and that
 downtime isn't quite the same.
 
 If it can be reproduced with just suspend/resume stuff then that's a different
 place to start looking than if it's migration only.

ah OK now i got it. No i can't reproduce with suspend resume. But while
migrating it happens directly at the end when the switch from host a to
b happens.

 Dave
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
 



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Paolo Bonzini

Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:

first of all i've now a memory image of a VM where i can reproduce it.


You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'? 
But google stress test doesn't report any error until you start 
migration _and_ it finishes?


That sounds good enough.  Can you upload the image somewhere (doesn't 
have to be a public place, you can contact David or others offlist)?



reproducing does NOT work if i boot the VM freshly i need to let it run
for some hours.

Then just when the migration finishes there is a short time frame where
the google stress app reports memory errors than when the migration
finishes it runs fine again.

It seems to me it is related to pause and unpause/resume?


 - with Unix migration between two QEMUs on the same host

now tested = same issue


 - with different hosts

already tested = same issue


 - with a different network (e.g. just a cross cable between two machines)

already tested = same issue


Another test:

- start the VM with -S, migrate, do errors appear on the destination?

Thanks,

Paolo



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
Hi,
Am 07.02.2014 14:19, schrieb Paolo Bonzini:
 Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:
 first of all i've now a memory image of a VM where i can reproduce it.
 
 You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
 But google stress test doesn't report any error until you start
 migration _and_ it finishes?

Sorry no i meant i have a VM where i saved the memory to disk - so i
don't need to wait hours until i can reproduce as it does not happen
with a fresh started VM. So it's a state file i think.

 Another test:
 
 - start the VM with -S, migrate, do errors appear on the destination?

I started with -S and the errors appear AFTER resuming/unpause the VM.
So it is fine until i resume it on the new host.

Stefan



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe - Profihost AG
it's always the same pattern there are too many 0 instead of X.

only seen:

read:0x ... expected:0x

or

read:0x ... expected:0x

or

read:0xbf00bf00 ... expected:0xbfffbfff

or

read:0x ... expected:0xb5b5b5b5b5b5b5b5

no idea if this helps.

Stefan

Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:
 Hi,
 Am 07.02.2014 14:19, schrieb Paolo Bonzini:
 Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:
 first of all i've now a memory image of a VM where i can reproduce it.

 You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
 But google stress test doesn't report any error until you start
 migration _and_ it finishes?
 
 Sorry no i meant i have a VM where i saved the memory to disk - so i
 don't need to wait hours until i can reproduce as it does not happen
 with a fresh started VM. So it's a state file i think.
 
 Another test:

 - start the VM with -S, migrate, do errors appear on the destination?
 
 I started with -S and the errors appear AFTER resuming/unpause the VM.
 So it is fine until i resume it on the new host.
 
 Stefan
 



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Dr. David Alan Gilbert
* Stefan Priebe (s.pri...@profihost.ag) wrote:
 anything i could try or debug? to help to find the problem?

I think the most useful would be to see if the problem is 
a new problem in the 1.7 you're using or has existed
for a while; depending on the machine type you used, it might
be possible to load that image on an earlier (or newer) qemu
and try the same test, however if the problem doesn't
repeat reliably it can be hard.

If you have any way of simplifying the configuration of the
VM it would be good; e.g. if you could get a failure on
something without graphics (-nographic) and USB.

Dave

 
 Stefan
 
 Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:
 it's always the same pattern there are too many 0 instead of X.
 
 only seen:
 
 read:0x ... expected:0x
 
 or
 
 read:0x ... expected:0x
 
 or
 
 read:0xbf00bf00 ... expected:0xbfffbfff
 
 or
 
 read:0x ... expected:0xb5b5b5b5b5b5b5b5
 
 no idea if this helps.
 
 Stefan
 
 Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:
 Hi,
 Am 07.02.2014 14:19, schrieb Paolo Bonzini:
 Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:
 first of all i've now a memory image of a VM where i can reproduce it.
 
 You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
 But google stress test doesn't report any error until you start
 migration _and_ it finishes?
 
 Sorry no i meant i have a VM where i saved the memory to disk - so i
 don't need to wait hours until i can reproduce as it does not happen
 with a fresh started VM. So it's a state file i think.
 
 Another test:
 
 - start the VM with -S, migrate, do errors appear on the destination?
 
 I started with -S and the errors appear AFTER resuming/unpause the VM.
 So it is fine until i resume it on the new host.
 
 Stefan
 
 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe

Am 07.02.2014 21:02, schrieb Dr. David Alan Gilbert:

* Stefan Priebe (s.pri...@profihost.ag) wrote:

anything i could try or debug? to help to find the problem?


I think the most useful would be to see if the problem is
a new problem in the 1.7 you're using or has existed
for a while; depending on the machine type you used, it might
be possible to load that image on an earlier (or newer) qemu
and try the same test, however if the problem doesn't
repeat reliably it can be hard.


I've seen this first with Qemu 1.5 but was not able to reproduce it for 
month. 1.4 was working fine.



If you have any way of simplifying the configuration of the
VM it would be good; e.g. if you could get a failure on
something without graphics (-nographic) and USB.


Sadly not ;-(


Dave



Stefan

Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:

it's always the same pattern there are too many 0 instead of X.

only seen:

read:0x ... expected:0x

or

read:0x ... expected:0x

or

read:0xbf00bf00 ... expected:0xbfffbfff

or

read:0x ... expected:0xb5b5b5b5b5b5b5b5

no idea if this helps.

Stefan

Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:

Hi,
Am 07.02.2014 14:19, schrieb Paolo Bonzini:

Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:

first of all i've now a memory image of a VM where i can reproduce it.


You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
But google stress test doesn't report any error until you start
migration _and_ it finishes?


Sorry no i meant i have a VM where i saved the memory to disk - so i
don't need to wait hours until i can reproduce as it does not happen
with a fresh started VM. So it's a state file i think.


Another test:

- start the VM with -S, migrate, do errors appear on the destination?


I started with -S and the errors appear AFTER resuming/unpause the VM.
So it is fine until i resume it on the new host.

Stefan




--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK





Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-07 Thread Stefan Priebe

anything i could try or debug? to help to find the problem?

Stefan

Am 07.02.2014 14:45, schrieb Stefan Priebe - Profihost AG:

it's always the same pattern there are too many 0 instead of X.

only seen:

read:0x ... expected:0x

or

read:0x ... expected:0x

or

read:0xbf00bf00 ... expected:0xbfffbfff

or

read:0x ... expected:0xb5b5b5b5b5b5b5b5

no idea if this helps.

Stefan

Am 07.02.2014 14:39, schrieb Stefan Priebe - Profihost AG:

Hi,
Am 07.02.2014 14:19, schrieb Paolo Bonzini:

Il 07/02/2014 14:04, Stefan Priebe - Profihost AG ha scritto:

first of all i've now a memory image of a VM where i can reproduce it.


You mean you start that VM with -incoming 'exec:cat /path/to/vm.img'?
But google stress test doesn't report any error until you start
migration _and_ it finishes?


Sorry no i meant i have a VM where i saved the memory to disk - so i
don't need to wait hours until i can reproduce as it does not happen
with a fresh started VM. So it's a state file i think.


Another test:

- start the VM with -S, migrate, do errors appear on the destination?


I started with -S and the errors appear AFTER resuming/unpause the VM.
So it is fine until i resume it on the new host.

Stefan





Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Alexandre DERUMIER
Do you force rbd_cache=true in ceph.conf?

if yes, do you use cache=writeback ?

according to ceph doc:
http://ceph.com/docs/next/rbd/qemu-rbd/

Important If you set rbd_cache=true, you must set cache=writeback or risk data 
loss. Without cache=writeback, QEMU will not send flush requests to librbd. If 
QEMU exits uncleanly in this configuration, filesystems on top of rbd can be 
corrupted.



- Mail original -

De: Stefan Priebe s.pri...@profihost.ag
À: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org
Envoyé: Mercredi 5 Février 2014 18:51:15
Objet: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

Hello,

after live migrating machines with a lot of memory (32GB, 48GB, ...) i
see pretty often crashing services after migration and the guest kernel
prints:

[1707620.031806] swap_free: Bad swap file entry 00377410
[1707620.031806] swap_free: Bad swap file entry 00593c48
[1707620.031807] swap_free: Bad swap file entry 03201430
[1707620.031807] swap_free: Bad swap file entry 01bc5900
[1707620.031807] swap_free: Bad swap file entry 0173ce40
[1707620.031808] swap_free: Bad swap file entry 011c0270
[1707620.031808] swap_free: Bad swap file entry 03c58ae8
[1707660.749059] BUG: Bad rss-counter state mm:88064d09f380 idx:1
val:1536
[1707660.749937] BUG: Bad rss-counter state mm:88064d09f380 idx:2
val:-1536

Qemu is 1.7

Does anybody know a fix?

Greets,
Stefan
___
pve-devel mailing list
pve-de...@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Stefan Priebe - Profihost AG

Am 06.02.2014 12:14, schrieb Alexandre DERUMIER:
 Do you force rbd_cache=true in ceph.conf?

no

 if yes, do you use cache=writeback ?

yes

So this should be safe.

PS: all my guests do not even have !!SWAP!!

# free|grep Swap
Swap:0  0  0

Stefan

 according to ceph doc:
 http://ceph.com/docs/next/rbd/qemu-rbd/
 
 Important If you set rbd_cache=true, you must set cache=writeback or risk 
 data loss. Without cache=writeback, QEMU will not send flush requests to 
 librbd. If QEMU exits uncleanly in this configuration, filesystems on top of 
 rbd can be corrupted.
 
 
 
 - Mail original - 
 
 De: Stefan Priebe s.pri...@profihost.ag 
 À: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org 
 Envoyé: Mercredi 5 Février 2014 18:51:15 
 Objet: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry 
 
 Hello, 
 
 after live migrating machines with a lot of memory (32GB, 48GB, ...) i 
 see pretty often crashing services after migration and the guest kernel 
 prints: 
 
 [1707620.031806] swap_free: Bad swap file entry 00377410 
 [1707620.031806] swap_free: Bad swap file entry 00593c48 
 [1707620.031807] swap_free: Bad swap file entry 03201430 
 [1707620.031807] swap_free: Bad swap file entry 01bc5900 
 [1707620.031807] swap_free: Bad swap file entry 0173ce40 
 [1707620.031808] swap_free: Bad swap file entry 011c0270 
 [1707620.031808] swap_free: Bad swap file entry 03c58ae8 
 [1707660.749059] BUG: Bad rss-counter state mm:88064d09f380 idx:1 
 val:1536 
 [1707660.749937] BUG: Bad rss-counter state mm:88064d09f380 idx:2 
 val:-1536 
 
 Qemu is 1.7 
 
 Does anybody know a fix? 
 
 Greets, 
 Stefan 
 ___ 
 pve-devel mailing list 
 pve-de...@pve.proxmox.com 
 http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
 



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Alexandre DERUMIER
PS: all my guests do not even have !!SWAP!!

Not sure is related to swap file.

I found an similar problem here, triggered with suspend/resume on ext4

http://lkml.indiana.edu/hypermail/linux/kernel/1106.3/01340.html


Maybe is it a guest kernel bug ?

- Mail original -

De: Stefan Priebe - Profihost AG s.pri...@profihost.ag
À: Alexandre DERUMIER aderum...@odiso.com
Cc: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org
Envoyé: Jeudi 6 Février 2014 12:19:36
Objet: Re: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry


Am 06.02.2014 12:14, schrieb Alexandre DERUMIER:
 Do you force rbd_cache=true in ceph.conf?

no

 if yes, do you use cache=writeback ?

yes

So this should be safe.

PS: all my guests do not even have !!SWAP!!

# free|grep Swap
Swap: 0 0 0

Stefan

 according to ceph doc:
 http://ceph.com/docs/next/rbd/qemu-rbd/

 Important If you set rbd_cache=true, you must set cache=writeback or risk 
 data loss. Without cache=writeback, QEMU will not send flush requests to 
 librbd. If QEMU exits uncleanly in this configuration, filesystems on top of 
 rbd can be corrupted.



 - Mail original -

 De: Stefan Priebe s.pri...@profihost.ag
 À: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org
 Envoyé: Mercredi 5 Février 2014 18:51:15
 Objet: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

 Hello,

 after live migrating machines with a lot of memory (32GB, 48GB, ...) i
 see pretty often crashing services after migration and the guest kernel
 prints:

 [1707620.031806] swap_free: Bad swap file entry 00377410
 [1707620.031806] swap_free: Bad swap file entry 00593c48
 [1707620.031807] swap_free: Bad swap file entry 03201430
 [1707620.031807] swap_free: Bad swap file entry 01bc5900
 [1707620.031807] swap_free: Bad swap file entry 0173ce40
 [1707620.031808] swap_free: Bad swap file entry 011c0270
 [1707620.031808] swap_free: Bad swap file entry 03c58ae8
 [1707660.749059] BUG: Bad rss-counter state mm:88064d09f380 idx:1
 val:1536
 [1707660.749937] BUG: Bad rss-counter state mm:88064d09f380 idx:2
 val:-1536

 Qemu is 1.7

 Does anybody know a fix?

 Greets,
 Stefan
 ___
 pve-devel mailing list
 pve-de...@pve.proxmox.com
 http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel




Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Stefan Priebe - Profihost AG
May be,

sadly i've no idea. Only using 3.10 Kernel with XFS.

Stefan

Am 06.02.2014 12:40, schrieb Alexandre DERUMIER:
 PS: all my guests do not even have !!SWAP!! 
 
 Not sure is related to swap file.
 
 I found an similar problem here, triggered with suspend/resume on ext4
 
 http://lkml.indiana.edu/hypermail/linux/kernel/1106.3/01340.html
 
 
 Maybe is it a guest kernel bug ?
 
 - Mail original - 
 
 De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
 À: Alexandre DERUMIER aderum...@odiso.com 
 Cc: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org 
 Envoyé: Jeudi 6 Février 2014 12:19:36 
 Objet: Re: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry 
 
 
 Am 06.02.2014 12:14, schrieb Alexandre DERUMIER: 
 Do you force rbd_cache=true in ceph.conf? 
 
 no 
 
 if yes, do you use cache=writeback ? 
 
 yes 
 
 So this should be safe. 
 
 PS: all my guests do not even have !!SWAP!! 
 
 # free|grep Swap 
 Swap: 0 0 0 
 
 Stefan 
 
 according to ceph doc: 
 http://ceph.com/docs/next/rbd/qemu-rbd/ 

 Important If you set rbd_cache=true, you must set cache=writeback or risk 
 data loss. Without cache=writeback, QEMU will not send flush requests to 
 librbd. If QEMU exits uncleanly in this configuration, filesystems on top of 
 rbd can be corrupted. 



 - Mail original - 

 De: Stefan Priebe s.pri...@profihost.ag 
 À: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org 
 Envoyé: Mercredi 5 Février 2014 18:51:15 
 Objet: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry 

 Hello, 

 after live migrating machines with a lot of memory (32GB, 48GB, ...) i 
 see pretty often crashing services after migration and the guest kernel 
 prints: 

 [1707620.031806] swap_free: Bad swap file entry 00377410 
 [1707620.031806] swap_free: Bad swap file entry 00593c48 
 [1707620.031807] swap_free: Bad swap file entry 03201430 
 [1707620.031807] swap_free: Bad swap file entry 01bc5900 
 [1707620.031807] swap_free: Bad swap file entry 0173ce40 
 [1707620.031808] swap_free: Bad swap file entry 011c0270 
 [1707620.031808] swap_free: Bad swap file entry 03c58ae8 
 [1707660.749059] BUG: Bad rss-counter state mm:88064d09f380 idx:1 
 val:1536 
 [1707660.749937] BUG: Bad rss-counter state mm:88064d09f380 idx:2 
 val:-1536 

 Qemu is 1.7 

 Does anybody know a fix? 

 Greets, 
 Stefan 
 ___ 
 pve-devel mailing list 
 pve-de...@pve.proxmox.com 
 http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Stefan Priebe - Profihost AG
some more things which happen during migration:

php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0
error 4 in php-cgi[40+6d7000]

php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20
error 4 in ZendOptimizer.so[7f1fb8e71000+147000]

cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp
7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000]

Stefan

Am 06.02.2014 13:10, schrieb Stefan Priebe - Profihost AG:
 May be,
 
 sadly i've no idea. Only using 3.10 Kernel with XFS.
 
 Stefan
 
 Am 06.02.2014 12:40, schrieb Alexandre DERUMIER:
 PS: all my guests do not even have !!SWAP!! 

 Not sure is related to swap file.

 I found an similar problem here, triggered with suspend/resume on ext4

 http://lkml.indiana.edu/hypermail/linux/kernel/1106.3/01340.html


 Maybe is it a guest kernel bug ?

 - Mail original - 

 De: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
 À: Alexandre DERUMIER aderum...@odiso.com 
 Cc: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org 
 Envoyé: Jeudi 6 Février 2014 12:19:36 
 Objet: Re: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry 


 Am 06.02.2014 12:14, schrieb Alexandre DERUMIER: 
 Do you force rbd_cache=true in ceph.conf? 

 no 

 if yes, do you use cache=writeback ? 

 yes 

 So this should be safe. 

 PS: all my guests do not even have !!SWAP!! 

 # free|grep Swap 
 Swap: 0 0 0 

 Stefan 

 according to ceph doc: 
 http://ceph.com/docs/next/rbd/qemu-rbd/ 

 Important If you set rbd_cache=true, you must set cache=writeback or risk 
 data loss. Without cache=writeback, QEMU will not send flush requests to 
 librbd. If QEMU exits uncleanly in this configuration, filesystems on top 
 of rbd can be corrupted. 



 - Mail original - 

 De: Stefan Priebe s.pri...@profihost.ag 
 À: pve-de...@pve.proxmox.com, qemu-devel qemu-devel@nongnu.org 
 Envoyé: Mercredi 5 Février 2014 18:51:15 
 Objet: [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry 

 Hello, 

 after live migrating machines with a lot of memory (32GB, 48GB, ...) i 
 see pretty often crashing services after migration and the guest kernel 
 prints: 

 [1707620.031806] swap_free: Bad swap file entry 00377410 
 [1707620.031806] swap_free: Bad swap file entry 00593c48 
 [1707620.031807] swap_free: Bad swap file entry 03201430 
 [1707620.031807] swap_free: Bad swap file entry 01bc5900 
 [1707620.031807] swap_free: Bad swap file entry 0173ce40 
 [1707620.031808] swap_free: Bad swap file entry 011c0270 
 [1707620.031808] swap_free: Bad swap file entry 03c58ae8 
 [1707660.749059] BUG: Bad rss-counter state mm:88064d09f380 idx:1 
 val:1536 
 [1707660.749937] BUG: Bad rss-counter state mm:88064d09f380 idx:2 
 val:-1536 

 Qemu is 1.7 

 Does anybody know a fix? 

 Greets, 
 Stefan 
 ___ 
 pve-devel mailing list 
 pve-de...@pve.proxmox.com 
 http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Marcin Gibuła

On 06.02.2014 15:03, Stefan Priebe - Profihost AG wrote:

some more things which happen during migration:

php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0
error 4 in php-cgi[40+6d7000]

php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20
error 4 in ZendOptimizer.so[7f1fb8e71000+147000]

cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp
7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000]


Hi,

I've seen memory corruptions after live (and offline) migrations as 
well. But in our enviroment its mostly (but not only) seen as timer 
corruption - guest hangs or have insane date in future. But I've seen 
segfaults and oopses as well.


Sadly it's very hard for me to reproduce it reliably but it occures on 
all types of linux guests - all versions of ubuntu, centos, debian, etc, 
so it doesn't seem to be connected to a specific guest kernel version. 
I've never seen windows crashing though. There was another guy here on 
qemu-devel who had similar issue and fixed it by running guest with 
no-kvmclock.


I've tested qemu 1.4 - 1.6 and kernels 3.4 - 3.10.

--
mg



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Dr. David Alan Gilbert
* Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:
 some more things which happen during migration:
 
 php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0
 error 4 in php-cgi[40+6d7000]
 
 php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20
 error 4 in ZendOptimizer.so[7f1fb8e71000+147000]
 
 cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp
 7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000]

OK, so lets just assume some part of memory (or CPU state, or memory
loaded off disk...)

You said before that it was happening on a 32GB image - is it *only*
happening on a 32GB or bigger VM, or is it just more likely?

I think you also said you were using 1.7; have you tried an older
version - i.e. is this a regression in 1.7 or don't we know?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [pve-devel] QEMU LIve Migration - swap_free: Bad swap file entry

2014-02-06 Thread Stefan Priebe

Hi,
Am 06.02.2014 20:51, schrieb Dr. David Alan Gilbert:

* Stefan Priebe - Profihost AG (s.pri...@profihost.ag) wrote:

some more things which happen during migration:

php5.2[20258]: segfault at a0 ip 00740656 sp 7fff53b694a0
error 4 in php-cgi[40+6d7000]

php5.2[20249]: segfault at c ip 7f1fb8ecb2b8 sp 7fff642d9c20
error 4 in ZendOptimizer.so[7f1fb8e71000+147000]

cron[3154]: segfault at 7f0008a70ed4 ip 7fc890b9d440 sp
7fff08a6f9b0 error 4 in libc-2.13.so[7fc890b67000+182000]


OK, so lets just assume some part of memory (or CPU state, or memory
loaded off disk...)

You said before that it was happening on a 32GB image - is it *only*
happening on a 32GB or bigger VM, or is it just more likely?


Not image, memory. I've only seen this with vms having more than 16GB or 
32GB memory. But maybe this also indicates that just the migration takes 
longer.



I think you also said you were using 1.7; have you tried an older
version - i.e. is this a regression in 1.7 or don't we know?
Don't know. Sadly i cannot reproduce this with test VMs only with 
production ones.


Stefan


Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK