date:20150116

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: selectable object size support

2015-01-16 Thread Teruaki Ishizaki


Hi Hitoshi,

Thanks for your check.

(2015/01/15 15:05), Hitoshi Mitake wrote:

At Tue, 13 Jan 2015 17:41:12 +0900,
Teruaki Ishizaki wrote:


Previously, qemu block driver of sheepdog used hard-coded VDI object size.
This patch enables users to handle block_size_shift value for
calculating VDI object size.

When you start qemu, you don't need to specify additional command option.

But when you create the VDI which doesn't have default object size
with qemu-img command, you specify block_size_shift option.

If you want to create a VDI of 8MB(1  23) object size,
you need to specify following command option.

  # qemu-img create -o block_size_shift=23 sheepdog:test1 100M

In addition, when you don't specify qemu-img command option,
a default value of sheepdog cluster is used for creating VDI.

  # qemu-img create sheepdog:test2 100M

Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
---
  block/sheepdog.c  |  138 +---
  include/block/block_int.h |1 +
  2 files changed, 117 insertions(+), 22 deletions(-)





@@ -1610,7 +1633,8 @@ out_with_err_set:
  if (bs) {
  bdrv_unref(bs);
  }
-g_free(buf);
+if (buf)


The above line has a style problem (white space).


I'll change it.




@@ -1669,6 +1693,17 @@ static int parse_redundancy(BDRVSheepdogState *s, const 
char *opt)
  return 0;
  }

+static int parse_block_size_shift(BDRVSheepdogState *s, const char *opt)
+{
+struct SheepdogInode *inode = s-inode;
+inode-block_size_shift = (uint8_t)atoi(opt);


This patch cannot create VDI with specified block size shift. Below is
the reason:

Initializing block_size_shift of the inode object here (in
parse_block_size_shift()) doesn't make sense. Because the
block_size_shift of newly created VDI is specified by block_size_shift
of SheepdogVdiReq (in sheepdog source code, sd_req.vdi). You need to
synchronize the struct and initialize the parameter. Below is a slack
solution:

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 525254e..3dc3359 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -168,7 +168,8 @@ typedef struct SheepdogVdiReq {
  uint32_t base_vdi_id;
  uint8_t copies;
  uint8_t copy_policy;
-uint8_t reserved[2];
+uint8_t store_policy;
+uint8_t block_size_shift;
  uint32_t snapid;
  uint32_t type;
  uint32_t pad[2];
@@ -1560,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s,
uint32_t *vdi_id, int snapshot,
  hdr.vdi_size = s-inode.vdi_size;
  hdr.copy_policy = s-inode.copy_policy;
  hdr.copies = s-inode.nr_copies;
+hdr.block_size_shift = s-inode.block_size_shift;

  ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen,
  rlen);



Oh, sorry for my miss-edit for the patch
I had tested the source code as you sugested fixing.

I'll fix the patch like that!

 # This is a very slack solution. You need to avoid using inode as a
 # temporal space of the parameter.
Originally, do_sd_create() had argument struct BDRVSheepdogState
which included inode, and parse_redundancy() substitute copy_policy
and　copies substituted for inode.

So, I created parse_block_size_shift() like parse_redundancy()
to handle block_size_shift like copy_policy and copies options.

Which point is temporal?


Thanks,
Teruaki Ishizaki

[Qemu-devel] [PATCH] qtest: Fix deadloop by running main loop AIO context's timers

2015-01-16 Thread Fam Zheng

qemu_clock_run_timers() only takes care of main_loop_tlg, we shouldn't
forget aio timer list groups.

Currently, the qemu_clock_deadline_ns_all (a few lines above) counts all
the timergroups of this clock type, including aio tlg, but we don't fire
them, so they are never cleared, which makes a dead loop.

For example, this function hangs when trying to drive throttled block
request queue with qtest clock_step.

Signed-off-by: Fam Zheng f...@redhat.com
---
 cpus.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cpus.c b/cpus.c
index 3a5323b..dd7e595 100644
--- a/cpus.c
+++ b/cpus.c
@@ -387,6 +387,7 @@ void qtest_clock_warp(int64_t dest)
 seqlock_write_unlock(timers_state.vm_clock_seqlock);
 
 qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
+timerlistgroup_run_timers(qemu_get_aio_context()-tlg);
 clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
 }
 qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
-- 
1.9.3

Re: [Qemu-devel] [RFC PATCH 2/4] pcie-aer: Fix command pcie_aer_inject_error is invalid

2015-01-16 Thread Chen Fan



On 01/12/2015 09:56 PM, Marcel Apfelbaum wrote:

On 01/12/2015 05:04 AM, Chen Fan wrote:

in spec PCI Express 3.0 section 6.2.6 Figure 6-3 virtual bridge part,
the flowchart showing tell us SERR# enable at Bridge Control register
associate with system error at Secondary Status register can send error
message. but bridge_control from dev-config is NULL, and SERR# was set
in dev-wmask in pcie_aer_init()

wmask denotes the register bits that can be written by the guest.

If you are referring to:
   pci_word_test_and_set_mask(dev-wmask + PCI_BRIDGE_CONTROL,
  PCI_BRIDGE_CTL_SERR);
that means that the OS *is able* to turn on/off SERR forwarding.

Hi marcel,

I saw the OS that turn on SERR# is to via evaluate _HPP (Hot Plug 
Parameters) method in ACPI.

is it the only way to turn on SERR#?

Thanks,
Chen




 which was implemented by root port and
swith devices, so we should add wmask (for w/r) bit set for bridge 
control.

we can user command like:
qemu_system_x86_64:
-device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,id=bridge1
-device x3130-upstream,bus=bridge1,id=up.1,addr=00.0
-device xio3130-downstream,bus=up.1,id=down.1,port=1,addr=00.0,chassis=5

(qemu) pcie_aer_inject_error net0 POISON_TLP

after that,
guest can output the error message.

Signed-off-by: Chen Fan chen.fan.f...@cn.fujitsu.com
---
  hw/pci/pcie_aer.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
index 7ca077a..571dc92 100644
--- a/hw/pci/pcie_aer.c
+++ b/hw/pci/pcie_aer.c
@@ -231,7 +231,8 @@ pcie_aer_msg_alldev(PCIDevice *dev, const 
PCIEAERMsg *msg)

   */
  static bool pcie_aer_msg_vbridge(PCIDevice *dev, const PCIEAERMsg 
*msg)

  {
-uint16_t bridge_control = pci_get_word(dev-config + 
PCI_BRIDGE_CONTROL);
Here we check if the Guest OS/firmware actually turned the #SERR 
forwarding on.


+uint16_t bridge_control = pci_get_word(dev-config + 
PCI_BRIDGE_CONTROL) |
+  pci_get_word(dev-wmask + 
PCI_BRIDGE_CONTROL);

I don't think that this check is correct given the above comments.
Please correct me if I mislead you,
Thanks,
Marcel




  if (pcie_aer_msg_is_uncor(msg)) {
  /* Received System Error */



.

Re: [Qemu-devel] [RFC PATCH v7 12/21] replay: recording and replaying clock ticks

2015-01-16 Thread Pavel Dovgaluk

 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 On 13/01/2015 10:21, Pavel Dovgaluk wrote:
  +/*! Reads next clock event from the input. */
+int64_t replay_read_clock(unsigned int kind)
+{
+if (kind = REPLAY_CLOCK_COUNT) {
+fprintf(stderr, invalid clock ID %d for replay\n, kind);
+exit(1);
+}
+
+replay_exec_instructions();
+
+if (replay_file) {
+if (skip_async_events(EVENT_CLOCK + kind)) {
+replay_read_next_clock(kind);
+}
+int64_t ret = replay_state.cached_clock[kind];
+
+return ret;
+}
+
+fprintf(stderr, REPLAY INTERNAL ERROR %d\n, __LINE__);
+exit(1);
+}
  
   Is this thread safe?
  It is, because order of main_loop and cpu_exec executions is protected
  by global mutex.
 
 
 Please document exactly which globals are protected by the rr QemuMutex,
 and which by the global mutex.  But I think as many variables as
 possible should be protected by the rr QemuMutex, for two reasons:
 
 1) people are working on threaded TCG.  While SMP record/replay is a
 whole different story, even UP TCG will be multithreaded.
 
 2) in the end it makes reviewing the code simpler.

There is one problem with protecting log file inside the replay code.
We probably should have the following sequence of calls:

lock_replay_mutex
replay_save_events
replay_run_event
unlock_replay_mutex

But replay_run_event can also generate some log events and we will have 
deadlock here.
That is why we rely on global mutex.

Pavel Dovgalyuk

Re: [Qemu-devel] global_mutex and multithread.

2015-01-16 Thread Jan Kiszka

On 2015-01-16 08:25, Mark Burton wrote:
 
 On 15 Jan 2015, at 22:41, Paolo Bonzini pbonz...@redhat.com wrote:



 On 15/01/2015 21:53, Mark Burton wrote:
 Jan said he had it working at least on ARM (MusicPal).

 yeah - our problem is when we enable multi-threads - which I dont believe 
 Jan did…

 Multithreaded TCG, or single-threaded TCG with SMP?
 
 He mentions SMP, - I assume thats single-threaded ….

Yes, I didn't patched anything towards multi-threaded SMP. Main reason:
there was no answer on how to emulated the memory models of that target
architecture over the host one which is mandatory if you let the
emulated CPUs run unsynchronized in parallel. Did this change?

 

 One thing I wonder - why do we need to go to the extent of mutexing
 in the TCG like this? Why can’t you simply put a mutex get/release on
 the slow path? If the core is going to do ‘fast path’ access to the
 memory, - even if that memory was IO mapped - would it matter if it
 didn’t have the mutex?

 Because there is no guarantee that the memory map isn't changed by a
 core under the feet of another.  The TLB (in particular the iotlb) is
 only valid with reference to a particular memory map.


 Changes to the memory map certainly happen in the slow path, but lookups
 are part of the fast path.  Even an rwlocks is too slow for a fast path,
 hence the plan of going with RCU.

 Could we arrange the world such that lookups ‘succeed’ (the wheels
 dont fall off) -ether getting the old value, or the new, but not getting
 rubbish - and we still only take the mutex if we are going to make
 alterations to the MM itself? (I have’t looked at the code around that…
 so sorry if the question is ridiculous).

 That's the definition of RCU. :)  Look at the docs in
 http://permalink.gmane.org/gmane.comp.emulators.qemu/313929 for more
 information. :)
 
 Ahh - I see !
 

 It's still not trivial to make it 100% correct, but at the same time
 it's not too hard to prepare something decent to play with.  Also, most
 of the work can be done with KVM so it's more or less independent from
 what you guys have been doing so far.
 
 Yes - the issue is if we end up relying on it.
 But - I see what you mean - these 2 things can ‘dovetail’ together 
 “independently” - so - Jan’s patch will be good for now, and then later we 
 can use RCU to make it work more generally (and more efficiently).
 
 So - our only small problem is getting Jan’s patch to work for multi-thread 
 :-))

See above regarding the potential dimension.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide

2015-01-16 Thread Scott Feldman

On Mon, Jan 12, 2015 at 3:40 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 On 11/01/2015 04:57, sfel...@gmail.com wrote:
 +PCI Configuration Space
 +---
 +
 +Each switch instance registers as a PCI device with PCI configuration space:
 +
 + offset  width   description value
 + -
 + 0x0 2   Vendor ID   0x1b36
 + 0x2 2   Device ID   0x0006
 + 0x4 4   Command/Status
 + 0x8 1   Revision ID 0x01
 + 0x9 3   Class code  0x2800
 + 0xC 1   Cache line size
 + 0xD 1   Latency timer
 + 0xE 1   Header type
 + 0xF 1   Built-in self test
 + 0x104   Base address low
 + 0x144   Base address high
 + 0x18-28 Reserved
 + 0x2C2   Subsystem vendor ID 0x
 + 0x2E2   Subsystem ID0x

 This should not be guaranteed to 0, should it?

Your're right.  Added a note that subsystem implementation will fill this in.


 + 0x30-38 Reserved
 + 0x3C1   Interrupt line
 + 0x3D1   Interrupt pin   0x00
 + 0x3E1   Min grant   0x00
 + 0x3D1   Max latency 0x00
 + 0x401   TRDY timeout
 + 0x411   Retry count
 + 0x422   Reserved
 +
 +
 +SECTION 3: Memory-Mapped Register Space
 +===
 +
 +There are two memory-mapped BARs.  BAR0 maps device register space and is
 +0x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
 +size, allowing for 256 MSI-X vectors.  The host BIOS will assign the base
 +address location.  The host driver/OS will map the base address to host 
 memory,
 +giving the driver mmio access to the device register space.

 No need for the bits after The host BIOS... since that's just normal PCI.

Gone.

 +All registers are 4 or 8 bytes long.  It is assumed host software will 
 access 4
 +byte registers with one 4-byte access, and 8 byte registers with either two
 +4-byte accesses or a single 8-byte access.  In the case of two 4-byte 
 accesses,
 +access must be lower and then upper 4-bytes, in that order.

 Double 4-byte accesses are not implemented, are they?

They are now :)  Tested on i386.  I'll include changes with v4.

 +Interrupt credits
 +^
 +
 +MSI-X vectors used for descriptor ring completions use a credit mechanism 
 for
 +efficient device, PCIe bus, OS and driver operations.  Each descriptor ring 
 has
 +a credit count which represent the number of outstanding descriptors to be
 +processed by the driver.  As the device marks descriptors complete, the 
 credit
 +count is incremented.  As the driver processes those outstanding 
 descriptors,
 +it returns credits back to the device.  This way, the device knows the 
 driver's
 +progress and can make decisions about when to fire the next interrupt or 
 not.
 +When the credit count is zero, and the first descriptors are posted for the
 +driver, a single interrupt is fired.  Once the interrupt is fired, the
 +interrupt is disabled (auto-masked).  In response to the interrupt, the 
 driver
 +will process descriptors and PIO write a returned credit value for that
 +descriptor ring.  If the driver returns all credits (the driver caught up 
 with
 +the device and there is no outstanding work), then the interrupt is 
 unmasked,
 +but not fired.  If only partial credits are returned, the interrupt remains
 +masked but the device generates an interrupt, signaling the driver that more
 +outstanding work is available.

 Perhaps mention that this masking is unrelated to the MSI-X interrupt
 mask register?

Done.

 +SECTION 5: Test Registers
 +=
 +
 +Rocker switch has several test registers to support troubleshooting register

 s/Rocker switch/Rocker/

Done.

 +access, interrupt generation, and DMA operations:
 +
 + TEST_REG, offset 0x0010, 32-bit (R/W)
 + TEST_REG64, offset 0x0018, 64-bit (R/W)
 + TEST_IRQ, offset 0x0020, 32-bit (R/W)
 + TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
 + TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
 + TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
 +
 +Reads to TEST_REG and TEST_REG64 will read a value 2x the last value 
 written to

 s/2x/equal to twice/

Done.

 +the register.  The 32-bit and 64-bit versions are for testing 32-bit and 
 64-bit
 +host accesses.

 Right now, as mentioned above, 64-bit registers must be accessed with a
 single 32-bit host access.

Fixed in implementation.

 In the case of 32-bit host accesses, should TEST_REG64's value be
 latched until the upper half is written?  If so, please mention it and
 describe that this behavior is shared with the other 64-bit Rocker
 registers.

 +Bits written to TEST_IRQ will cause same (unmasked) bits to be written to
 +IRQ_STAT and an interrupt

Re: [Qemu-devel] [PATCH v7] block/raw-posix.c: Fixes raw_getlength() on Mac OS X so that it reports the correct length of a real CD

2015-01-16 Thread Markus Armbruster

Programmingkid programmingk...@gmail.com writes:

 This patch allows Mac OS X to use a real CDROM disc in QEMU. Testing this 
 patch will require using QEMU v2.2.0 because the current git version has a 
 bug in it that prevents /dev/cdrom from being used. make check did pass and 
 my Debian boot disc did work. 

 Signed-off-by: John Arbuckle programmingk...@gmail.com

Subject line is too long, and the body lacks line breaks.  Suggest
something like:

block/raw-posix.c: Fix raw_getlength() on Mac OS X for CD

This patch allows Mac OS X to use a real CDROM disc in QEMU.
Testing this patch will require using QEMU v2.2.0 because the
current git version has a bug in it that prevents /dev/cdrom from
being used.  make check did pass and my Debian boot disc did work.

Re: [Qemu-devel] global_mutex and multithread.

2015-01-16 Thread Frederic Konrad


On 16/01/2015 09:07, Jan Kiszka wrote:

On 2015-01-16 08:25, Mark Burton wrote:

On 15 Jan 2015, at 22:41, Paolo Bonzini pbonz...@redhat.com wrote:



On 15/01/2015 21:53, Mark Burton wrote:

Jan said he had it working at least on ARM (MusicPal).

yeah - our problem is when we enable multi-threads - which I dont believe Jan 
did…

Multithreaded TCG, or single-threaded TCG with SMP?

He mentions SMP, - I assume thats single-threaded ….

Yes, I didn't patched anything towards multi-threaded SMP. Main reason:
there was no answer on how to emulated the memory models of that target
architecture over the host one which is mandatory if you let the
emulated CPUs run unsynchronized in parallel. Did this change?

Hi Jan,

Actually it's what we are trying to do, running emulated CPUs in parallel.

I get a double mutex_lock error (eg: no unlock) I must have missed something
during the rebase.
I'll check.

Thanks,
Fred


One thing I wonder - why do we need to go to the extent of mutexing
in the TCG like this? Why can’t you simply put a mutex get/release on
the slow path? If the core is going to do ‘fast path’ access to the
memory, - even if that memory was IO mapped - would it matter if it
didn’t have the mutex?

Because there is no guarantee that the memory map isn't changed by a
core under the feet of another.  The TLB (in particular the iotlb) is
only valid with reference to a particular memory map.
Changes to the memory map certainly happen in the slow path, but lookups
are part of the fast path.  Even an rwlocks is too slow for a fast path,
hence the plan of going with RCU.

Could we arrange the world such that lookups ‘succeed’ (the wheels
dont fall off) -ether getting the old value, or the new, but not getting
rubbish - and we still only take the mutex if we are going to make
alterations to the MM itself? (I have’t looked at the code around that…
so sorry if the question is ridiculous).

That's the definition of RCU. :)  Look at the docs in
http://permalink.gmane.org/gmane.comp.emulators.qemu/313929 for more
information. :)

Ahh - I see !


It's still not trivial to make it 100% correct, but at the same time
it's not too hard to prepare something decent to play with.  Also, most
of the work can be done with KVM so it's more or less independent from
what you guys have been doing so far.

Yes - the issue is if we end up relying on it.
But - I see what you mean - these 2 things can ‘dovetail’ together 
“independently” - so - Jan’s patch will be good for now, and then later we can 
use RCU to make it work more generally (and more efficiently).

So - our only small problem is getting Jan’s patch to work for multi-thread :-))

See above regarding the potential dimension.

Jan

[Qemu-devel] [PATCH v5 1/5] qemu-io: Account IO by aio_read and aio_write

2015-01-16 Thread Fam Zheng

This will enable accounting of aio requests issued from qemu-io aio
read/write commands.

Signed-off-by: Fam Zheng f...@redhat.com
---
 qemu-io-cmds.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index e708552..29377cd 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -13,6 +13,7 @@
 #include block/qapi.h
 #include qemu/main-loop.h
 #include qemu/timer.h
+#include sysemu/block-backend.h
 
 #define CMD_NOFILE_OK   0x01
 
@@ -1337,6 +1338,7 @@ out:
 }
 
 struct aio_ctx {
+BlockDriverState *bs;
 QEMUIOVector qiov;
 int64_t offset;
 char *buf;
@@ -1344,6 +1346,7 @@ struct aio_ctx {
 int vflag;
 int Cflag;
 int Pflag;
+BlockAcctCookie acct;
 int pattern;
 struct timeval t1;
 };
@@ -1361,6 +1364,8 @@ static void aio_write_done(void *opaque, int ret)
 goto out;
 }
 
+block_acct_done(ctx-bs-stats, ctx-acct);
+
 if (ctx-qflag) {
 goto out;
 }
@@ -1398,6 +1403,8 @@ static void aio_read_done(void *opaque, int ret)
 g_free(cmp_buf);
 }
 
+block_acct_done(ctx-bs-stats, ctx-acct);
+
 if (ctx-qflag) {
 goto out;
 }
@@ -1453,6 +1460,7 @@ static int aio_read_f(BlockDriverState *bs, int argc, 
char **argv)
 int nr_iov, c;
 struct aio_ctx *ctx = g_new0(struct aio_ctx, 1);
 
+ctx-bs = bs;
 while ((c = getopt(argc, argv, CP:qv)) != EOF) {
 switch (c) {
 case 'C':
@@ -1506,6 +1514,7 @@ static int aio_read_f(BlockDriverState *bs, int argc, 
char **argv)
 }
 
 gettimeofday(ctx-t1, NULL);
+block_acct_start(bs-stats, ctx-acct, ctx-qiov.size, BLOCK_ACCT_READ);
 bdrv_aio_readv(bs, ctx-offset  9, ctx-qiov,
ctx-qiov.size  9, aio_read_done, ctx);
 return 0;
@@ -1549,6 +1558,7 @@ static int aio_write_f(BlockDriverState *bs, int argc, 
char **argv)
 int pattern = 0xcd;
 struct aio_ctx *ctx = g_new0(struct aio_ctx, 1);
 
+ctx-bs = bs;
 while ((c = getopt(argc, argv, CqP:)) != EOF) {
 switch (c) {
 case 'C':
@@ -1598,6 +1608,7 @@ static int aio_write_f(BlockDriverState *bs, int argc, 
char **argv)
 }
 
 gettimeofday(ctx-t1, NULL);
+block_acct_start(bs-stats, ctx-acct, ctx-qiov.size, BLOCK_ACCT_WRITE);
 bdrv_aio_writev(bs, ctx-offset  9, ctx-qiov,
 ctx-qiov.size  9, aio_write_done, ctx);
 return 0;
-- 
1.9.3

[Qemu-devel] [PATCH v5 0/5] block: Add a qemu-iotests case for IO throttling

2015-01-16 Thread Fam Zheng

v5: Rebase and improve the test. Please review again.

Patch dependencies:

This test depends on the qtest timer fix to run correctly.
http://lists.gnu.org/archive/html/qemu-devel/2015-01/msg01865.html

Also depends on the os check fix to run at all:
http://lists.gnu.org/archive/html/qemu-devel/2015-01/msg01848.html

Original cover letter
-

There is a change in qemu-io sub-commands aio_read and aio_write, which
makes the aio requests accounted and the statistics reflected in blockstats.

Note that IO throttling implementation allows overcommiting of requests, so the
actual IO happened in a time unit may be a bit larger than given limits. In the
test case, the stats numbers are compared with a 10% error tolerance, to make
room for such flexibility in order to improve determinism.

Fam


Fam Zheng (5):
  qemu-io: Account IO by aio_read and aio_write
  qtest: Add scripts/qtest.py
  qemu-iotests: Add VM method qtest() to iotests.py
  qemu-iotests: Allow caller to disable underscore convertion for qmp
  qemu-iotests: Add 093 for IO throttling

 qemu-io-cmds.c|  11 +
 scripts/qtest.py  |  71 +
 tests/qemu-iotests/093| 103 ++
 tests/qemu-iotests/093.out|   5 ++
 tests/qemu-iotests/group  |   1 +
 tests/qemu-iotests/iotests.py |  23 --
 6 files changed, 210 insertions(+), 4 deletions(-)
 create mode 100644 scripts/qtest.py
 create mode 100755 tests/qemu-iotests/093
 create mode 100644 tests/qemu-iotests/093.out

-- 
1.9.3

[Qemu-devel] [PATCH v5 2/5] qtest: Add scripts/qtest.py

2015-01-16 Thread Fam Zheng

This adds scripts/qtest.py as a python library for qtest protocol.

This is a skeleton with a basic cmd method to execute a command,
reading and parsing of qtest output could be added later on demand.

Signed-off-by: Fam Zheng f...@redhat.com
---
 scripts/qtest.py | 71 
 1 file changed, 71 insertions(+)
 create mode 100644 scripts/qtest.py

diff --git a/scripts/qtest.py b/scripts/qtest.py
new file mode 100644
index 000..25391f5
--- /dev/null
+++ b/scripts/qtest.py
@@ -0,0 +1,71 @@
+# QEMU qtest library
+#
+# Copyright (C) 2015 Red Hat Inc.
+#
+# Authors:
+#  Fam Zheng f...@redhat.com
+#
+# This work is licensed under the terms of the GNU GPL, version 2.  See
+# the COPYING file in the top-level directory.
+#
+# Based on qmp.py.
+#
+
+import errno
+import socket
+
+class QEMUQtestProtocol(object):
+def __init__(self, address, server=False):
+
+Create a QEMUQtestProtocol object.
+
+@param address: QEMU address, can be either a unix socket path (string)
+or a tuple in the form ( address, port ) for a TCP
+connection
+@param server: server mode listens on the socket (bool)
+@raise socket.error on socket connection errors
+@note No connection is established, this is done by the connect() or
+  accept() methods
+
+self._address = address
+self._sock = self._get_sock()
+if server:
+self._sock.bind(self._address)
+self._sock.listen(1)
+
+def _get_sock(self):
+if isinstance(self._address, tuple):
+family = socket.AF_INET
+else:
+family = socket.AF_UNIX
+return socket.socket(family, socket.SOCK_STREAM)
+
+def connect(self):
+
+Connect to the qtest socket.
+
+@raise socket.error on socket connection errors
+
+self._sock.connect(self._address)
+
+def accept(self):
+
+Await connection from QEMU.
+
+@raise socket.error on socket connection errors
+
+self._sock, _ = self._sock.accept()
+
+def cmd(self, qtest_cmd):
+
+Send a qtest command on the wire.
+
+@param qtest_cmd: qtest command text to be sent
+
+self._sock.sendall(qtest_cmd + \n)
+
+def close(self):
+self._sock.close()
+
+def settimeout(self, timeout):
+self._sock.settimeout(timeout)
-- 
1.9.3

[Qemu-devel] [PATCH v5 4/5] qemu-iotests: Allow caller to disable underscore convertion for qmp

2015-01-16 Thread Fam Zheng

QMP command block_set_io_throttle expects underscores in parameters
instead of dashes: {iops,bps}_{rd,wr,max}.

Add optional argument conv_keys (defaults to True, backward compatible),
it will be used in IO throttling test case.

Reviewed-by: Benoit Canet ben...@irqsave.net
Signed-off-by: Fam Zheng f...@redhat.com
---
 tests/qemu-iotests/iotests.py | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 85cb9a5..1402854 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -185,11 +185,14 @@ class VM(object):
 self._popen = None
 
 underscore_to_dash = string.maketrans('_', '-')
-def qmp(self, cmd, **args):
+def qmp(self, cmd, conv_keys=True, **args):
 '''Invoke a QMP command and return the result dict'''
 qmp_args = dict()
 for k in args.keys():
-qmp_args[k.translate(self.underscore_to_dash)] = args[k]
+if conv_keys:
+qmp_args[k.translate(self.underscore_to_dash)] = args[k]
+else:
+qmp_args[k] = args[k]
 
 return self._qmp.cmd(cmd, args=qmp_args)
 
-- 
1.9.3

[Qemu-devel] [PATCH v5 5/5] qemu-iotests: Add 093 for IO throttling

2015-01-16 Thread Fam Zheng

This case utilizes qemu-io command aio_{read,write} -q to verify the
effectiveness of IO throttling options.

It's implemented by driving the vm timer from qtest protocol, so the
throttling timers are signaled with determinied time duration. Then we
verify the completed IO requests are within 10% error of bps and iops
limits.

null protocol is used as the disk backend so that no actual disk IO is
performed on host, this will make the blockstats much more
deterministic. Both null-aio and null-co are covered, which is also
a simple cross validation test for the driver code.

Signed-off-by: Fam Zheng f...@redhat.com
---
 tests/qemu-iotests/093 | 103 +
 tests/qemu-iotests/093.out |   5 +++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 109 insertions(+)
 create mode 100755 tests/qemu-iotests/093
 create mode 100644 tests/qemu-iotests/093.out

diff --git a/tests/qemu-iotests/093 b/tests/qemu-iotests/093
new file mode 100755
index 000..d12cc25
--- /dev/null
+++ b/tests/qemu-iotests/093
@@ -0,0 +1,103 @@
+#!/usr/bin/env python
+#
+# Tests for IO throttling
+#
+# Copyright (C) 2015 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see http://www.gnu.org/licenses/.
+#
+
+import iotests
+
+class ThrottleTestCase(iotests.QMPTestCase):
+test_img = null-aio://
+
+def blockstats(self, device):
+result = self.vm.qmp(query-blockstats)
+for r in result['return']:
+if r['device'] == device:
+stat = r['stats']
+return stat['rd_bytes'], stat['rd_operations'], 
stat['wr_bytes'], stat['wr_operations']
+raise Exception(Device not found for blockstats: %s % device)
+
+def setUp(self):
+self.vm = iotests.VM().add_drive(self.test_img)
+self.vm.launch()
+
+def tearDown(self):
+self.vm.shutdown()
+
+def do_test_throttle(self, seconds, params):
+def check_limit(limit, num):
+# IO throttling algorithm is discrete, allow 10% error so the test
+# is more
+return limit == 0 or \
+   (num  seconds * limit * 1.1
+   and num  seconds * limit * 0.9)
+
+nsec_per_sec = 10
+
+params['device'] = 'drive0'
+params['bps_max'] = 1
+params['iops_max'] = 1
+
+result = self.vm.qmp(block_set_io_throttle, conv_keys=False, 
**params)
+self.assert_qmp(result, 'return', {})
+
+# Set vm clock to a known value
+ns = seconds * nsec_per_sec
+self.vm.qtest(clock_step %d % ns)
+
+# Append many requests into the throttle queue. They will drain bps_max
+# and iops_max, but the rest requests won't get executed until we
+# advance the virtual clock with qtest interface
+for i in range(1000):
+self.vm.hmp_qemu_io(drive0, aio_read %d 4096 % (i * 512))
+self.vm.hmp_qemu_io(drive0, aio_write %d 4096 % (i * 512))
+
+start_rd_bytes, start_rd_iops, start_wr_bytes, start_wr_iops = 
self.blockstats('drive0')
+
+self.vm.qtest(clock_step %d % ns)
+end_rd_bytes, end_rd_iops, end_wr_bytes, end_wr_iops = 
self.blockstats('drive0')
+
+rd_bytes = end_rd_bytes - start_rd_bytes
+rd_iops = end_rd_iops - start_rd_iops
+wr_bytes = end_wr_bytes - start_wr_bytes
+wr_iops = end_wr_iops - start_wr_iops
+
+self.assertTrue(check_limit(params['bps'], rd_bytes + wr_bytes))
+self.assertTrue(check_limit(params['bps_rd'], rd_bytes))
+self.assertTrue(check_limit(params['bps_wr'], wr_bytes))
+self.assertTrue(check_limit(params['iops'], rd_iops + wr_iops))
+self.assertTrue(check_limit(params['iops_rd'], rd_iops))
+self.assertTrue(check_limit(params['iops_wr'], wr_iops))
+
+def test_all(self):
+params = {bps: 128000,
+  bps_rd: 128000,
+  bps_wr: 128000,
+  iops: 64,
+  iops_rd: 64,
+  iops_wr: 64
+ }
+for tk in params:
+limits = dict([(k, 0) for k in params])
+limits[tk] = params[tk]
+self.do_test_throttle(10, limits)
+
+class ThrottleTestCoroutine(ThrottleTestCase):
+test_img = null-co://
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=[raw])
diff --git a/tests/qemu-iotests/093.out

[Qemu-devel] [PATCH v5 3/5] qemu-iotests: Add VM method qtest() to iotests.py

2015-01-16 Thread Fam Zheng

This will allow test case to run command in qtest protocol. It's
write-only for now.

Signed-off-by: Fam Zheng f...@redhat.com
---
 tests/qemu-iotests/iotests.py | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 241b5ee..85cb9a5 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -21,8 +21,11 @@ import re
 import subprocess
 import string
 import unittest
-import sys; sys.path.append(os.path.join(os.path.dirname(__file__), '..', 
'..', 'scripts', 'qmp'))
+import sys
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', 'scripts'))
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', 'scripts', 
'qmp'))
 import qmp
+import qtest
 import struct
 
 __all__ = ['imgfmt', 'imgproto', 'test_dir' 'qemu_img', 'qemu_io',
@@ -81,10 +84,12 @@ class VM(object):
 def __init__(self):
 self._monitor_path = os.path.join(test_dir, 'qemu-mon.%d' % 
os.getpid())
 self._qemu_log_path = os.path.join(test_dir, 'qemu-log.%d' % 
os.getpid())
+self._qtest_path = os.path.join(test_dir, 'qemu-qtest.%d' % 
os.getpid())
 self._args = qemu_args + ['-chardev',
  'socket,id=mon,path=' + self._monitor_path,
  '-mon', 'chardev=mon,mode=control',
- '-qtest', 'stdio', '-machine', 'accel=qtest',
+ '-qtest', 'unix:path=' + self._qtest_path,
+ '-machine', 'accel=qtest',
  '-display', 'none', '-vga', 'none']
 self._num_drives = 0
 
@@ -160,9 +165,11 @@ class VM(object):
 qemulog = open(self._qemu_log_path, 'wb')
 try:
 self._qmp = qmp.QEMUMonitorProtocol(self._monitor_path, 
server=True)
+self._qtest = qtest.QEMUQtestProtocol(self._qtest_path, 
server=True)
 self._popen = subprocess.Popen(self._args, stdin=devnull, 
stdout=qemulog,
stderr=subprocess.STDOUT)
 self._qmp.accept()
+self._qtest.accept()
 except:
 os.remove(self._monitor_path)
 raise
@@ -173,6 +180,7 @@ class VM(object):
 self._qmp.cmd('quit')
 self._popen.wait()
 os.remove(self._monitor_path)
+os.remove(self._qtest_path)
 os.remove(self._qemu_log_path)
 self._popen = None
 
@@ -185,6 +193,10 @@ class VM(object):
 
 return self._qmp.cmd(cmd, args=qmp_args)
 
+def qtest(self, cmd):
+'''Send a qtest command to guest'''
+return self._qtest.cmd(cmd)
+
 def get_qmp_event(self, wait=False):
 '''Poll for one queued QMP events and return it'''
 return self._qmp.pull_event(wait=wait)
-- 
1.9.3

[Qemu-devel] [PATCH v4 02/10] virtio-net: use qemu_mac_strdup_printf

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

Signed-off-by: Scott Feldman sfel...@gmail.com
---
 hw/net/virtio-net.c |   12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 45da34a..698156f 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -226,12 +226,6 @@ static void rxfilter_notify(NetClientState *nc)
 }
 }
 
-static char *mac_strdup_printf(const uint8_t *mac)
-{
-return g_strdup_printf(%.2x:%.2x:%.2x:%.2x:%.2x:%.2x, mac[0],
-mac[1], mac[2], mac[3], mac[4], mac[5]);
-}
-
 static intList *get_vlan_table(VirtIONet *n)
 {
 intList *list, *entry;
@@ -284,12 +278,12 @@ static RxFilterInfo 
*virtio_net_query_rxfilter(NetClientState *nc)
 info-multicast_overflow = n-mac_table.multi_overflow;
 info-unicast_overflow = n-mac_table.uni_overflow;
 
-info-main_mac = mac_strdup_printf(n-mac);
+info-main_mac = qemu_mac_strdup_printf(n-mac);
 
 str_list = NULL;
 for (i = 0; i  n-mac_table.first_multi; i++) {
 entry = g_malloc0(sizeof(*entry));
-entry-value = mac_strdup_printf(n-mac_table.macs + i * ETH_ALEN);
+entry-value = qemu_mac_strdup_printf(n-mac_table.macs + i * 
ETH_ALEN);
 entry-next = str_list;
 str_list = entry;
 }
@@ -298,7 +292,7 @@ static RxFilterInfo 
*virtio_net_query_rxfilter(NetClientState *nc)
 str_list = NULL;
 for (i = n-mac_table.first_multi; i  n-mac_table.in_use; i++) {
 entry = g_malloc0(sizeof(*entry));
-entry-value = mac_strdup_printf(n-mac_table.macs + i * ETH_ALEN);
+entry-value = qemu_mac_strdup_printf(n-mac_table.macs + i * 
ETH_ALEN);
 entry-next = str_list;
 str_list = entry;
 }
-- 
1.7.10.4

[Qemu-devel] [PATCH v4 00/10] rocker: add new rocker ethernet switch device

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

v4:

 - Per Paolo Bonzini review:
 - move reg_guide.txt to docs/specs/rocker.txt
 - fix some spelling/grammer mistakes in the rocker.txt doc
 - fix some misleading/wrong statements in rocker.txt
 - add double 4-byte access for 64-bit registers
 - define new ROCKER_Exxx to replace usage of errno.h return codes
 - Add patch from David Ahern to timestamp debug output

v3:

 - Per Stefan Hajnoczi review:
 - move HMP rocker cmds to info rocker
 - prefix QMP rocker cmds with query-
 - tag QMP cmds as Since 2.3
 - convert structs to typedef with CamelCase naming
 - Remove SDHCI device ID move patch...Paolo Bonzini is addressing
   SDHCI move is separate patch.

v2:

 - Address reg_guide.txt review comments from Eric Blake
 - Address QMP review comments from Eric Blake

v1:

[This is a collaboration between myself and Jiri Pirko].

This patch set adds a new ethernet switch device, called rocker.  Rocker is
intended to emulate HW features of switch ASICs found in today's
data-center-class switch/routers.  The original motivation in creating a new
device is to accelerate device driver development for ethernet switches in the
Linux kernel.  A device driver for rocker already exists in the Linux 3.18
kernel and loads against this device.  Basic L2 switching (bridging)
functionality is offloaded to the device.  Work continues to enable offloading
of L3 routing functions and ACLs, as well as support for a flow-based modes,
such as OpenVSwitch with OpenFlow.  Future support for terminating L2-over-L3
tunnels is also planned.

The core network processing functions are based on the spec of a real device:
Broadcom's OF-DPA.  Specifically, rocker borrows OF-DPA's network processing
pipeline comprised of flow match and action tables.  Only the OF-DPA spec was
used in constructing rocker.  The rocker developers do not have access to the
real OF-DPA's software source code, so this is a clean-room, ground-up
development.

Each rocker device is a PCI device with a memory-mapped register space and
MSI-X interrupts for command and event processing, as well as CPU-bound I/O.
Each device can support up to 62 front-panel ports, which present themselves
as -netdev attachment points.  The device is programmed using OF-DPA flow and
group tables to setup the flow pipeline.  The programming defines the
forwarding path for packets ingressing on 'front-panel' ports.  The forwarding
path can look at L2/L3/L4 packet header to forward the packet to its
destination.  For the performance path, packets would ingress and egress only
on the device, and not be passed up to the device driver (or host OS).  The
slow path for control packets will forward packets to the CPU via the device
driver for host OS processing.

A QMP/HMP interface is added to give inside into the device's internal port
configuration and flow/group tables.

A test directory is included with some basic sanity tests to verify the device
and driver.

David Ahern (1):
  rocker: timestamp on the debug logs helps correlate with events in
the VM

Scott Feldman (9):
  net: add MAC address string printer
  virtio-net: use qemu_mac_strdup_printf
  rocker: add register programming guide
  pci: add rocker device ID
  pci: add network device class 'other' for network switches
  rocker: add new rocker switch device
  qmp: add rocker device support
  rocker: add tests
  MAINTAINERS: add rocker

 MAINTAINERS|6 +
 default-configs/pci.mak|1 +
 docs/specs/pci-ids.txt |1 +
 docs/specs/rocker.txt  |  960 +
 hmp-commands.hx|   24 +
 hmp.c  |  303 +
 hmp.h  |4 +
 hw/net/Makefile.objs   |3 +
 hw/net/rocker/rocker.c | 1511 +
 hw/net/rocker/rocker.h |   85 ++
 hw/net/rocker/rocker_desc.c|  379 ++
 hw/net/rocker/rocker_desc.h|   57 +
 hw/net/rocker/rocker_fp.c  |  242 
 hw/net/rocker/rocker_fp.h  |   54 +
 hw/net/rocker/rocker_hw.h  |  491 +++
 hw/net/rocker/rocker_of_dpa.c  | 2622 
 hw/net/rocker/rocker_of_dpa.h  |   25 +
 hw/net/rocker/rocker_tlv.h |  244 
 hw/net/rocker/rocker_world.c   |  108 ++
 hw/net/rocker/rocker_world.h   |   63 +
 hw/net/rocker/test/README  |5 +
 hw/net/rocker/test/all |   19 +
 hw/net/rocker/test/bridge  |   43 +
 hw/net/rocker/test/bridge-stp  |   52 +
 hw/net/rocker/test/bridge-vlan |   52 +
 hw/net/rocker/test/bridge-vlan-stp |   64 +
 hw/net/rocker/test/port|   22 +
 hw/net/rocker/test/tut.dot |8 +
 hw/net/virtio-net.c|   12 +-
 include/hw/pci/pci.h   |1 +
 include/hw/pci/pci_ids.h   |1 +
 include/net/net.h  |1 +
 monitor.c

[Qemu-devel] [PATCH v4 08/10] rocker: add tests

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

Add some basic test for rocker to test L2/L3/L4 functionality.  Requires an
external test environment, simp, located here:

https://github.com/scottfeldman/simp

To run tests, simp environment must be installed and a suitable VM image built
and installed with a Linux 3.18 (or greater) kernel with rocker driver support
enabled.

Signed-off-by: Scott Feldman sfel...@gmail.com
---
 hw/net/rocker/test/README  |5 +++
 hw/net/rocker/test/all |   19 +++
 hw/net/rocker/test/bridge  |   43 
 hw/net/rocker/test/bridge-stp  |   52 +
 hw/net/rocker/test/bridge-vlan |   52 +
 hw/net/rocker/test/bridge-vlan-stp |   64 
 hw/net/rocker/test/port|   22 +
 hw/net/rocker/test/tut.dot |8 +
 8 files changed, 265 insertions(+)
 create mode 100644 hw/net/rocker/test/README
 create mode 100755 hw/net/rocker/test/all
 create mode 100755 hw/net/rocker/test/bridge
 create mode 100755 hw/net/rocker/test/bridge-stp
 create mode 100755 hw/net/rocker/test/bridge-vlan
 create mode 100755 hw/net/rocker/test/bridge-vlan-stp
 create mode 100755 hw/net/rocker/test/port
 create mode 100644 hw/net/rocker/test/tut.dot

diff --git a/hw/net/rocker/test/README b/hw/net/rocker/test/README
new file mode 100644
index 000..531e673
--- /dev/null
+++ b/hw/net/rocker/test/README
@@ -0,0 +1,5 @@
+Tests require simp (simple network simulator) found here:
+
+https://github.com/scottfeldman/simp
+
+Run 'all' to run all tests.
diff --git a/hw/net/rocker/test/all b/hw/net/rocker/test/all
new file mode 100755
index 000..d5ae963
--- /dev/null
+++ b/hw/net/rocker/test/all
@@ -0,0 +1,19 @@
+echo -n Running port test...  
+./port
+if [ $? -eq 0 ]; then echo pass; else echo FAILED; exit 1; fi
+
+echo -n Running bridge test...
+./bridge
+if [ $? -eq 0 ]; then echo pass; else echo FAILED; exit 1; fi
+
+echo -n Running bridge STP test...
+./bridge-stp
+if [ $? -eq 0 ]; then echo pass; else echo FAILED; exit 1; fi
+
+echo -n Running bridge VLAN test...   
+./bridge-vlan
+if [ $? -eq 0 ]; then echo pass; else echo FAILED; exit 1; fi
+
+echo -n Running bridge VLAN STP test...   
+./bridge-vlan-stp
+if [ $? -eq 0 ]; then echo pass; else echo FAILED; exit 1; fi
diff --git a/hw/net/rocker/test/bridge b/hw/net/rocker/test/bridge
new file mode 100755
index 000..f1f555a
--- /dev/null
+++ b/hw/net/rocker/test/bridge
@@ -0,0 +1,43 @@
+simp destroy .*
+simp create -o sw1:rocker:sw1 tut tut.dot
+simp start tut
+sleep 10
+while ! simp ssh tut sw1 --cmd ping -c 1 localhost /dev/null; do sleep 1; 
done
+while ! simp ssh tut h1 --cmd ping -c 1 localhost /dev/null; do sleep 1; 
done
+while ! simp ssh tut h2 --cmd ping -c 1 localhost /dev/null; do sleep 1; 
done
+
+# configure a 2-port bridge
+
+simp ssh tut sw1 --cmd sudo /sbin/ip link add name br0 type bridge
+simp ssh tut sw1 --cmd sudo /sbin/ip link set dev swp1 master br0
+simp ssh tut sw1 --cmd sudo /sbin/ip link set dev swp2 master br0
+
+# turn off vlan default_pvid on br0
+
+simp ssh tut sw1 --cmd echo 0 | sudo dd 
of=/sys/class/net/br0/bridge/default_pvid 2 /dev/null
+
+# turn off learning and flooding in SW
+
+simp ssh tut sw1 --cmd sudo /sbin/bridge link set dev swp1 learning off
+simp ssh tut sw1 --cmd sudo /sbin/bridge link set dev swp2 learning off
+
+simp ssh tut sw1 --cmd sudo /sbin/bridge link set dev swp1 flood off
+simp ssh tut sw1 --cmd sudo /sbin/bridge link set dev swp2 flood off
+
+# bring up bridge and ports
+
+simp ssh tut sw1 --cmd sudo ifconfig br0 up
+simp ssh tut sw1 --cmd sudo ifconfig swp1 up
+simp ssh tut sw1 --cmd sudo ifconfig swp2 up
+simp ssh tut sw1 --cmd sudo ifconfig br0 11.0.0.3/24
+
+# config IP on hosts
+
+simp ssh tut h1 --cmd sudo ifconfig swp1 11.0.0.1/24
+simp ssh tut h2 --cmd sudo ifconfig swp1 11.0.0.2/24
+
+# test...
+
+simp ssh tut h1 --cmd ping -c10 11.0.0.2 /dev/null
+if [ $? -ne 0 ]; then exit 1; fi
+simp ssh tut h1 --cmd ping -c10 11.0.0.3 /dev/null
diff --git a/hw/net/rocker/test/bridge-stp b/hw/net/rocker/test/bridge-stp
new file mode 100755
index 000..60f4483
--- /dev/null
+++ b/hw/net/rocker/test/bridge-stp
@@ -0,0 +1,52 @@
+simp destroy .*
+simp create -o sw1:rocker:sw1 tut tut.dot
+simp start tut
+sleep 10
+while ! simp ssh tut sw1 --cmd ping -c 1 localhost /dev/null; do sleep 1; 
done
+while ! simp ssh tut h1 --cmd ping -c 1 localhost /dev/null; do sleep 1; 
done
+while ! simp ssh tut h2 --cmd ping -c 1 localhost /dev/null; do sleep 1; 
done
+
+# configure a 2-port bridge
+
+simp ssh tut sw1 --cmd sudo /sbin/ip link add name br0 type bridge
+simp ssh tut sw1 --cmd sudo brctl stp br0 on
+simp ssh tut sw1 --cmd sudo /sbin/ip link set dev swp1 master br0
+simp ssh tut sw1 --cmd sudo /sbin/ip link set dev swp2 master br0
+
+# turn off vlan default_pvid on br0
+
+simp ssh tut sw1 --cmd echo

[Qemu-devel] [PATCH v4 07/10] qmp: add rocker device support

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

Add QMP/HMP support for rocker devices.  This is mostly for debugging purposes
to see inside the device's tables and port configurations.  Some examples:

(qemu) info rocker sw1
name: sw1
id: 0x013512005452
ports: 4

(qemu) info rocker-ports sw1
ena/speed/ auto
  port  linkduplex neg?
 sw1.1  up 10G  FD  No
 sw1.2  up 10G  FD  No
 sw1.3  !ena   10G  FD  No
 sw1.4  !ena   10G  FD  No

(qemu) info rocker-of-dpa-flows sw1
prio tbl hits key(mask) -- actions
260   lport 1 vlan 1 LLDP src 00:02:00:00:02:00 dst 01:80:c2:00:00:0e
260   lport 1 vlan 1 ARP src 00:02:00:00:02:00 dst 00:02:00:00:03:00
260   lport 2 vlan 2 IPv6 src 00:02:00:00:03:00 dst 33:33:ff:00:00:02 
proto 58
350   vlan 2 dst 33:33:ff:00:00:02 -- write group 0x3201 goto tbl 
60
260   lport 2 vlan 2 IPv6 src 00:02:00:00:03:00 dst 33:33:ff:00:03:00 
proto 58
350  1vlan 2 dst 33:33:ff:00:03:00 -- write group 0x3201 goto tbl 
60
260   lport 2 vlan 2 ARP src 00:02:00:00:03:00 dst 00:02:00:00:02:00
350  2vlan 2 dst 00:02:00:00:02:00 -- write group 0x0201 goto tbl 
60
260  1lport 2 vlan 2 IP src 00:02:00:00:03:00 dst 00:02:00:00:02:00 
proto 1
350  2vlan 1 dst 00:02:00:00:03:00 -- write group 0x0102 goto tbl 
60
260  1lport 1 vlan 1 IP src 00:02:00:00:02:00 dst 00:02:00:00:03:00 
proto 1
260   lport 1 vlan 1 IPv6 src 00:02:00:00:02:00 dst 33:33:ff:00:00:01 
proto 58
350   vlan 1 dst 33:33:ff:00:00:01 -- write group 0x3100 goto tbl 
60
260   lport 1 vlan 1 IPv6 src 00:02:00:00:02:00 dst 33:33:ff:00:02:00 
proto 58
350  1vlan 1 dst 33:33:ff:00:02:00 -- write group 0x3100 goto tbl 
60
160  173  lport 2 vlan 2 LLDP src any dst 01:80:c2:00:00:0e -- write 
group 0x0200
160  6lport 2 vlan 2 IPv6 src any dst any -- write group 0x0200
160  174  lport 1 vlan 1 LLDP src any dst 01:80:c2:00:00:0e -- write 
group 0x0100
160  174  lport 2 vlan 2 IP src any dst any -- write group 0x0200
160  6lport 1 vlan 1 IPv6 src any dst any -- write group 0x0100
160  181  lport 2 vlan 2 ARP src any dst any -- write group 0x0200
110  715  lport 2 -- apply new vlan 2 goto tbl 20
160  177  lport 1 vlan 1 ARP src any dst any -- write group 0x0100
160  174  lport 1 vlan 1 IP src any dst any -- write group 0x0100
110  717  lport 1 -- apply new vlan 1 goto tbl 20
10   1432 lport 0(0x) -- goto tbl 10

(qemu) info rocker-of-dpa-groups sw1
id (decode) -- buckets
0x3201 (type L2 multicast vlan 2 index 1) -- groups [0x0201,0x0200]
0x0201 (type L2 interface vlan 2 lport 1) -- pop vlan out lport 1
0x0102 (type L2 interface vlan 1 lport 2) -- pop vlan out lport 2
0x0200 (type L2 interface vlan 2 lport 0) -- pop vlan out lport 0
0x0100 (type L2 interface vlan 1 lport 0) -- pop vlan out lport 0
0x3100 (type L2 multicast vlan 1 index 0) -- groups [0x0102,0x0100]

Signed-off-by: Scott Feldman sfel...@gmail.com
Signed-off-by: Jiri Pirko j...@resnulli.us
---
 hmp-commands.hx   |   24 
 hmp.c |  303 
 hmp.h |4 +
 hw/net/rocker/rocker.c|   45 ++
 hw/net/rocker/rocker_fp.c |   10 ++
 hw/net/rocker/rocker_fp.h |1 +
 hw/net/rocker/rocker_of_dpa.c |  308 +
 monitor.c |   28 
 qapi-schema.json  |3 +
 qapi/rocker.json  |  259 ++
 qmp-commands.hx   |   97 +
 11 files changed, 1082 insertions(+)
 create mode 100644 qapi/rocker.json

diff --git a/hmp-commands.hx b/hmp-commands.hx
index e37bc8b..6873a3a 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1787,5 +1787,29 @@ show available trace events and their state
 ETEXI
 
 STEXI
+@item rocker @var{name}
+@findex rocker
+Show Rocker(s)
+ETEXI
+
+STEXI
+@item rocker_ports @var{name}
+@findex rocker_ports
+Show Rocker ports
+ETEXI
+
+STEXI
+@item rocker_of_dpa_flows @var{name} [@var{tbl_id}]
+@findex rocker_of_dpa_flows
+Show Rocker OF-DPA flow tables
+ETEXI
+
+STEXI
+@item rocker_of_dpa_groups @var{name} [@var{type}]
+@findex rocker_of_dpa_groups
+Show Rocker OF-DPA groups
+ETEXI
+
+STEXI
 @end table
 ETEXI
diff --git a/hmp.c b/hmp.c
index 481be80..949d43c 100644
--- a/hmp.c
+++ b/hmp.c
@@ -15,6 +15,7 @@
 
 #include hmp.h
 #include net/net.h
+#include net/eth.h
 #include sysemu/char.h
 #include qemu/option.h
 #include qemu/timer.h
@@ -1813,3 +1814,305 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 
 qapi_free_MemoryDeviceInfoList(info_list);
 }
+
+void hmp_rocker(Monitor *mon, const QDict *qdict)
+{
+const char *name = qdict_get_str(qdict, name);
+RockerSwitch *rocker;
+Error *errp = NULL;
+

[Qemu-devel] [PATCH v4 04/10] pci: add rocker device ID

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

Signed-off-by: Scott Feldman sfel...@gmail.com
Signed-off-by: Jiri Pirko j...@resnulli.us
---
 docs/specs/pci-ids.txt |1 +
 include/hw/pci/pci.h   |1 +
 2 files changed, 2 insertions(+)

diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt
index c6732fe..e4a4490 100644
--- a/docs/specs/pci-ids.txt
+++ b/docs/specs/pci-ids.txt
@@ -45,6 +45,7 @@ PCI devices (other than virtio):
 1b36:0003  PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt)
 1b36:0004  PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt)
 1b36:0005  PCI test device (docs/specs/pci-testdev.txt)
+1b36:0006  PCI Rocker Ethernet switch device
 1b36:0007  PCI SD Card Host Controller Interface (SDHCI)
 
 All these devices are documented in docs/specs.
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 97a83d3..2c58bf2 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -88,6 +88,7 @@
 #define PCI_DEVICE_ID_REDHAT_SERIAL2 0x0003
 #define PCI_DEVICE_ID_REDHAT_SERIAL4 0x0004
 #define PCI_DEVICE_ID_REDHAT_TEST0x0005
+#define PCI_DEVICE_ID_REDHAT_ROCKER  0x0006
 #define PCI_DEVICE_ID_REDHAT_SDHCI   0x0007
 #define PCI_DEVICE_ID_REDHAT_QXL 0x0100
 
-- 
1.7.10.4

[Qemu-devel] [PATCH v4 01/10] net: add MAC address string printer

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

We can use this in virtio-net code as well as new Rocker driver code, so
up-level this.

Signed-off-by: Scott Feldman sfel...@gmail.com
---
 include/net/net.h |1 +
 net/net.c |7 +++
 2 files changed, 8 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 008d610..90742f0 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -97,6 +97,7 @@ typedef struct NICState {
 bool peer_deleted;
 } NICState;
 
+char *qemu_mac_strdup_printf(const uint8_t *macaddr);
 NetClientState *qemu_find_netdev(const char *id);
 int qemu_find_net_clients_except(const char *id, NetClientState **ncs,
  NetClientOptionsKind type, int max);
diff --git a/net/net.c b/net/net.c
index 7acc162..2387616 100644
--- a/net/net.c
+++ b/net/net.c
@@ -151,6 +151,13 @@ int parse_host_port(struct sockaddr_in *saddr, const char 
*str)
 return 0;
 }
 
+char *qemu_mac_strdup_printf(const uint8_t *macaddr)
+{
+return g_strdup_printf(%.2x:%.2x:%.2x:%.2x:%.2x:%.2x,
+   macaddr[0], macaddr[1], macaddr[2],
+   macaddr[3], macaddr[4], macaddr[5]);
+}
+
 void qemu_format_nic_info_str(NetClientState *nc, uint8_t macaddr[6])
 {
 snprintf(nc-info_str, sizeof(nc-info_str),
-- 
1.7.10.4

[Qemu-devel] [PATCH v4 05/10] pci: add network device class 'other' for network switches

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

Rocker is an ethernet switch device, so add 'other' network device class as
defined by PCI to cover these types of devices.

Signed-off-by: Scott Feldman sfel...@gmail.com
Signed-off-by: Jiri Pirko j...@resnulli.us
---
 include/hw/pci/pci_ids.h |1 +
 1 file changed, 1 insertion(+)

diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index d7be386..c6de710 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -23,6 +23,7 @@
 #define PCI_CLASS_STORAGE_OTHER  0x0180
 
 #define PCI_CLASS_NETWORK_ETHERNET   0x0200
+#define PCI_CLASS_NETWORK_OTHER  0x0280
 
 #define PCI_CLASS_DISPLAY_VGA0x0300
 #define PCI_CLASS_DISPLAY_OTHER  0x0380
-- 
1.7.10.4

[Qemu-devel] [PATCH v4 03/10] rocker: add register programming guide

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

This is the register programming guide for the Rocker device.  It's intended
for driver writers and device writers.  It covers the device's PCI space,
the register set, DMA interface, and interrupts.

Signed-off-by: Scott Feldman sfel...@gmail.com
Signed-off-by: Jiri Pirko j...@resnulli.us
---
 docs/specs/rocker.txt |  960 +
 1 file changed, 960 insertions(+)
 create mode 100644 docs/specs/rocker.txt

diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.txt
new file mode 100644
index 000..fa9db19
--- /dev/null
+++ b/docs/specs/rocker.txt
@@ -0,0 +1,960 @@
+Rocker Network Switch Register Programming Guide
+Copyright (c) Scott Feldman sfel...@gmail.com
+Copyright (c) Neil Horman nhor...@tuxdriver.com
+Version 0.11, 12/29/2014
+
+LICENSE
+===
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+SECTION 1: Introduction
+===
+
+Overview
+
+
+This document describes the hardware/software interface for the Rocker switch
+device.  The intended audience is authors of OS drivers and device emulation
+software.
+
+Notations and Conventions
+-
+
+o In register descriptions, [n:m] indicates a range from bit n to bit m,
+inclusive.
+o Use of leading 0x indicates a hexadecimal number.
+o Use of leading 0b indicates a binary number.
+o The use of RSVD or Reserved indicates that a bit or field is reserved for
+future use.
+o Field width is in bytes, unless otherwise noted.
+o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear
+on read
+o TLV values in network-byte-order are designated with (N).
+
+
+SECTION 2: PCI Configuration Registers
+==
+
+PCI Configuration Space
+---
+
+Each switch instance registers as a PCI device with PCI configuration space:
+
+   offset  width   description value
+   -
+   0x0 2   Vendor ID   0x1b36
+   0x2 2   Device ID   0x0006
+   0x4 4   Command/Status
+   0x8 1   Revision ID 0x01
+   0x9 3   Class code  0x2800
+   0xC 1   Cache line size
+   0xD 1   Latency timer
+   0xE 1   Header type
+   0xF 1   Built-in self test
+   0x104   Base address low
+   0x144   Base address high
+   0x18-28 Reserved
+   0x2C2   Subsystem vendor ID *
+   0x2E2   Subsystem ID*
+   0x30-38 Reserved
+   0x3C1   Interrupt line
+   0x3D1   Interrupt pin   0x00
+   0x3E1   Min grant   0x00
+   0x3D1   Max latency 0x00
+   0x401   TRDY timeout
+   0x411   Retry count
+   0x422   Reserved
+
+
+* Assigned by sub-system implementation
+
+SECTION 3: Memory-Mapped Register Space
+===
+
+There are two memory-mapped BARs.  BAR0 maps device register space and is
+0x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
+size, allowing for 256 MSI-X vectors.
+
+All registers are 4 or 8 bytes long.  It is assumed host software will access 4
+byte registers with one 4-byte access, and 8 byte registers with either two
+4-byte accesses or a single 8-byte access.  In the case of two 4-byte accesses,
+access must be lower and then upper 4-bytes, in that order.
+
+BAR0 device register space is organized as follows:
+
+   offset  description
+   --
+   0x-0x000f   Bogus registers to catch misbehaving
+   drivers.  Writes do nothing.  Reads
+   back as 0xDEADBABE.
+   0x0010-0x00ff   Test registers
+   0x0300-0x03ff   General purpose registers
+   0x1000-0x1fff   Descriptor control
+
+Holes in register space are reserved.  Writes to reserved registers do nothing.
+Reads to reserved registers read back as 0.
+
+No fancy stuff like write-combining is enabled on any of the registers.
+
+BAR1 MSI-X register space is organized as follows:
+
+   offset  description
+   --
+   0x-0x0fff   MSI-X vector table (256 vectors total)
+   0x1000-0x1fff   MSI-X PBA table
+
+
+SECTION 4: Interrupts, DMA, and Endianess

[Qemu-devel] [PATCH v4 09/10] MAINTAINERS: add rocker

2015-01-16 Thread sfeldma

From: Scott Feldman sfel...@gmail.com

Signed-off-by: Scott Feldman sfel...@gmail.com
Signed-off-by: Jiri Pirko j...@resnulli.us
---
 MAINTAINERS |6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 430688d..8b6f8d4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -736,6 +736,12 @@ S: Maintained
 F: hw/net/vmxnet*
 F: hw/scsi/vmw_pvscsi*
 
+Rocker
+M: Scott Feldman sfel...@gmail.com
+M: Jiri Pirko j...@resnulli.us
+S: Maintained
+F: hw/net/rocker/
+
 Subsystems
 --
 Audio
-- 
1.7.10.4

[Qemu-devel] [PATCH v4 10/10] rocker: timestamp on the debug logs helps correlate with events in the VM

2015-01-16 Thread sfeldma

From: David Ahern dsah...@gmail.com

Signed-off-by: David Ahern dsah...@gmail.com
Signed-off-by: Scott Feldman sfel...@gmail.com
---
 hw/net/rocker/rocker.h |   11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/net/rocker/rocker.h b/hw/net/rocker/rocker.h
index ef77487..5ae8aff 100644
--- a/hw/net/rocker/rocker.h
+++ b/hw/net/rocker/rocker.h
@@ -25,7 +25,16 @@
 
 #if defined(DEBUG_ROCKER)
 #  define DPRINTF(fmt, ...) \
-do { fprintf(stderr, ROCKER:  fmt, ## __VA_ARGS__); } while (0)
+do {   \
+struct timeval tv; \
+char timestr[64];  \
+time_t now;\
+gettimeofday(tv, NULL);   \
+now = tv.tv_sec;   \
+strftime(timestr, sizeof(timestr), %T, localtime(now)); \
+fprintf(stderr, %s.%06ld , timestr, tv.tv_usec); \
+fprintf(stderr, ROCKER:  fmt, ## __VA_ARGS__);   \
+} while (0)
 #else
 static inline GCC_FMT_ATTR(1, 2) int DPRINTF(const char *fmt, ...)
 {
-- 
1.7.10.4

Re: [Qemu-devel] global_mutex and multithread.

2015-01-16 Thread Mark Burton


 On 16 Jan 2015, at 09:07, Jan Kiszka jan.kis...@siemens.com wrote:
 
 On 2015-01-16 08:25, Mark Burton wrote:
 
 On 15 Jan 2015, at 22:41, Paolo Bonzini pbonz...@redhat.com wrote:
 
 
 
 On 15/01/2015 21:53, Mark Burton wrote:
 Jan said he had it working at least on ARM (MusicPal).
 
 yeah - our problem is when we enable multi-threads - which I dont believe 
 Jan did…
 
 Multithreaded TCG, or single-threaded TCG with SMP?
 
 He mentions SMP, - I assume thats single-threaded ….
 
 Yes, I didn't patched anything towards multi-threaded SMP. Main reason:
 there was no answer on how to emulated the memory models of that target
 architecture over the host one which is mandatory if you let the
 emulated CPUs run unsynchronized in parallel. Did this change?
 

No - we just decided to stop pressing the button…
I think this is the ‘x86 on ARM’ issue ? - our plan is to get ARM of x86 
working (or that class), and the worry about the other way round.

I dont see why SMP ARM on X86 wouldn’t work with your patch?

Cheers

Mark.

 
 
 One thing I wonder - why do we need to go to the extent of mutexing
 in the TCG like this? Why can’t you simply put a mutex get/release on
 the slow path? If the core is going to do ‘fast path’ access to the
 memory, - even if that memory was IO mapped - would it matter if it
 didn’t have the mutex?
 
 Because there is no guarantee that the memory map isn't changed by a
 core under the feet of another.  The TLB (in particular the iotlb) is
 only valid with reference to a particular memory map.
 
 
 Changes to the memory map certainly happen in the slow path, but lookups
 are part of the fast path.  Even an rwlocks is too slow for a fast path,
 hence the plan of going with RCU.
 
 Could we arrange the world such that lookups ‘succeed’ (the wheels
 dont fall off) -ether getting the old value, or the new, but not getting
 rubbish - and we still only take the mutex if we are going to make
 alterations to the MM itself? (I have’t looked at the code around that…
 so sorry if the question is ridiculous).
 
 That's the definition of RCU. :)  Look at the docs in
 http://permalink.gmane.org/gmane.comp.emulators.qemu/313929 for more
 information. :)
 
 Ahh - I see !
 
 
 It's still not trivial to make it 100% correct, but at the same time
 it's not too hard to prepare something decent to play with.  Also, most
 of the work can be done with KVM so it's more or less independent from
 what you guys have been doing so far.
 
 Yes - the issue is if we end up relying on it.
 But - I see what you mean - these 2 things can ‘dovetail’ together 
 “independently” - so - Jan’s patch will be good for now, and then later we 
 can use RCU to make it work more generally (and more efficiently).
 
 So - our only small problem is getting Jan’s patch to work for multi-thread 
 :-))
 
 See above regarding the potential dimension.
 
 Jan
 
 -- 
 Siemens AG, Corporate Technology, CT RTC ITP SES-DE
 Corporate Competence Center Embedded Linux


 +44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton

Re: [Qemu-devel] [PATCH 1/4] configure: opengl overhaul

2015-01-16 Thread Gerd Hoffmann

  Hi,

  +# opengl probe, used by milkymist-tmu2
 
 Maybe remove this part of the comment about milkymist?

Will do.

  +if test $opengl != no ; then
  +  opengl_pkgs=gl
  +  if $pkg_config $opengl_pkgs; then
  +opengl_libs=$($pkg_config --libs $opengl_pkgs) -lX11
  +opengl=yes
 
 `pkg-config gl` might be successful even without X11 available. In this 
 case, $opengl will be set to 'yes', but linking will probably fail.

Indeed.

 Maybe we should keep GLX separated from OpenGL; for SDL we don't need 
 X11, but for milkymist-tmu2 we apparently do.

That should be only temporary.  There will be infrastructure for device
emulations to ask the UI for a opengl context.  Once this is in place we
should be able to switch over milkymist-tmu2 to that so it doesn't has
to initialize opengl on its own.

cheers,
  Gerd

Re: [Qemu-devel] [virtio] virtqueue allocation and thread-safety

2015-01-16 Thread Vasile Catalin-B50542


Does the guest memory allocation include the descriptor tables allocation?
Which part of the structures is the descriptor tables?

On 15.01.2015 03:41, Fam Zheng wrote:

On Wed, 01/14 14:01, Vasile Catalin-B50542 wrote:

Hi,

I'm trying to make a new virtio device.
I got it running (I made a functional dummy device  guest driver).
Now I'm trying to build some communication between the device and guest
driver.
I can't seem to find where the actual allocation of virtqueues are made.
I've looked inside virtio_init(), but it just allocates the vq structures
and I don't see
any pointers inside that structure that might give and idea of something
dynamically
allocated. There is a member of that structure named vector, but it's type
is uint16_t.
I've also looked inside the virtio_add_queue(), and it just makes some
constant assignments.
Where are the vqs buffer space actually allocated?

The guest memory is prepared by guest. See the virtio spec Virtual I/O Device
(VIRTIO) Version 1.0 section 3.2.1 Supplying Buffers to The Device:

quote

The driver offers buffers to one of the device’s virtqueues as follows:

1. The driver places the buffer into free descriptor(s) in the descriptor
table, chaining as necessary (see 2.4.5 The Virtqueue Descriptor Table).

2. The driver places the index of the head of the descriptor chain into the
next ring entry of the available ring.

3. Steps 1 and 2 MAY be performed repeatedly if batching is possible.

4. The driver performs suitable a memory barrier to ensure the device sees the
updated descriptor table and available ring before the next step.

5. The available idx is increased by the number of descriptor chain heads added
to the available ring.

6. The driver performs a suitable memory barrier to ensure that it updates the
idx field before checking for notification suppression.

7. If notifications are not suppressed, the driver notifies the device of the
new available buffers.

/unquote

Fam


One more thing. Are the virtqueue functions thread safe, from both point of
views
(qemu and guest driver API's view)?
Also I don't see any dynamic allocations/freeing when pushing and popping,
either.

Cătă



--

CatalinVasile

Intern, DN-Software

FreescaleSemiconductor, Inc.

www.freescale.com

phone: 073-021-1938

e-mail: catalin.vas...@freescale.com mailto:your.n...@freescale.com

*Freescale_Logo-nosemi_Lh_4c***

This e-mail, and any associated attachments have been classified as:

[ ] Public

[ ] Freescale Semiconductor Internal Use Only

[ ] Freescale Semiconductor Confidential Proprietary

Re: [Qemu-devel] [PATCH v3 6/9] rocker: add new rocker switch device

2015-01-16 Thread Jason Wang


On 01/11/2015 11:57 AM, sfel...@gmail.com wrote:
 From: Scott Feldman sfel...@gmail.com

 Rocker is a simulated ethernet switch device.  The device supports up to 62
 front-panel ports and supports L2 switching and L3 routing functions, as well
 as L2/L3/L4 ACLs.  The device presents a single PCI device for each switch,
 with a memory-mapped register space for device driver access.

 Rocker device is invoked with -device, for example a 4-port switch:

   -device rocker,name=sw1,len-ports=4,ports[0]=dev0,ports[1]=dev1, \
  ports[2]=dev2,ports[3]=dev3

 Each port is a netdev and can be paired with using -netdev id=port name.

 Signed-off-by: Scott Feldman sfel...@gmail.com
 Signed-off-by: Jiri Pirko j...@resnulli.us
 ---

Looks like the devices does not support state saving. How about tagging
it with unmigratable first?
  default-configs/pci.mak   |1 +
  hw/net/Makefile.objs  |3 +
  hw/net/rocker/rocker.c| 1395 +
  hw/net/rocker/rocker.h|   76 ++
  hw/net/rocker/rocker_desc.c   |  379 +++
  hw/net/rocker/rocker_desc.h   |   57 +
  hw/net/rocker/rocker_fp.c |  232 +
  hw/net/rocker/rocker_fp.h |   53 +
  hw/net/rocker/rocker_hw.h |  475 +
  hw/net/rocker/rocker_of_dpa.c | 2314 
 +
  hw/net/rocker/rocker_of_dpa.h |   25 +
  hw/net/rocker/rocker_tlv.h|  244 +
  hw/net/rocker/rocker_world.c  |  108 ++
  hw/net/rocker/rocker_world.h  |   63 ++
  14 files changed, 5425 insertions(+)
  create mode 100644 hw/net/rocker/rocker.c
  create mode 100644 hw/net/rocker/rocker.h
  create mode 100644 hw/net/rocker/rocker_desc.c
  create mode 100644 hw/net/rocker/rocker_desc.h
  create mode 100644 hw/net/rocker/rocker_fp.c
  create mode 100644 hw/net/rocker/rocker_fp.h
  create mode 100644 hw/net/rocker/rocker_hw.h
  create mode 100644 hw/net/rocker/rocker_of_dpa.c
  create mode 100644 hw/net/rocker/rocker_of_dpa.h
  create mode 100644 hw/net/rocker/rocker_tlv.h
  create mode 100644 hw/net/rocker/rocker_world.c
  create mode 100644 hw/net/rocker/rocker_world.h
[...]
 +
 +typedef struct rocker {
 +/* private */
 +PCIDevice parent_obj;
 +/* public */
 +
 +MemoryRegion mmio;
 +MemoryRegion msix_bar;
 +
 +/* switch configuration */
 +char *name;  /* switch name */
 +uint32_t fp_ports;   /* front-panel port count */
 +NICPeers *fp_ports_peers;
 +MACAddr fp_start_macaddr;/* front-panel port 0 mac addr */
 +uint64_t switch_id;  /* switch id */

Looks like the function of name ad switch_id are duplicated?
 +
 +/* front-panel ports */
 +FpPort *fp_port[ROCKER_FP_PORTS_MAX];
 +
 +/* register backings */
 +uint32_t test_reg;
 +uint64_t test_reg64;
 +dma_addr_t test_dma_addr;
 +uint32_t test_dma_size;
 +
 +/* desc rings */
 +DescRing **rings;
 +
 +/* switch worlds */
 +World *worlds[ROCKER_WORLD_TYPE_MAX];
 +World *world_dflt;
 +
 +QLIST_ENTRY(rocker) next;
 +} Rocker;
 +
[...]
 +
 +static int tx_consume(Rocker *r, DescInfo *info)
 +{
 +PCIDevice *dev = PCI_DEVICE(r);
 +char *buf = desc_get_buf(info, true);
 +RockerTlv *tlv_frag;
 +RockerTlv *tlvs[ROCKER_TLV_TX_MAX + 1];
 +struct iovec iov[ROCKER_TX_FRAGS_MAX] = { { 0, }, };
 +uint32_t pport;
 +uint32_t port;
 +uint16_t tx_offload = ROCKER_TX_OFFLOAD_NONE;
 +uint16_t tx_l3_csum_off = 0;
 +uint16_t tx_tso_mss = 0;
 +uint16_t tx_tso_hdr_len = 0;
 +int iovcnt = 0;
 +int err = 0;
 +int rem;
 +int i;
 +
 +if (!buf) {
 +return -ENXIO;
 +}
 +
 +rocker_tlv_parse(tlvs, ROCKER_TLV_TX_MAX, buf, desc_tlv_size(info));
 +
 +if (!tlvs[ROCKER_TLV_TX_FRAGS]) {
 +return -EINVAL;
 +}
 +
 +pport = rocker_get_pport_by_tx_ring(r, desc_get_ring(info));
 +if (!fp_port_from_pport(pport, port)) {
 +return -EINVAL;
 +}
 +
 +if (tlvs[ROCKER_TLV_TX_OFFLOAD]) {
 +tx_offload = rocker_tlv_get_u8(tlvs[ROCKER_TLV_TX_OFFLOAD]);
 +}
 +
 +switch (tx_offload) {
 +case ROCKER_TX_OFFLOAD_L3_CSUM:
 +if (!tlvs[ROCKER_TLV_TX_L3_CSUM_OFF]) {
 +return -EINVAL;
 +}
 +case ROCKER_TX_OFFLOAD_TSO:
 +if (!tlvs[ROCKER_TLV_TX_TSO_MSS] ||
 +!tlvs[ROCKER_TLV_TX_TSO_HDR_LEN]) {
 +return -EINVAL;
 +}
 +}
 +
 +if (tlvs[ROCKER_TLV_TX_L3_CSUM_OFF]) {
 +tx_l3_csum_off = 
 rocker_tlv_get_le16(tlvs[ROCKER_TLV_TX_L3_CSUM_OFF]);
 +}
 +
 +if (tlvs[ROCKER_TLV_TX_TSO_MSS]) {
 +tx_tso_mss = rocker_tlv_get_le16(tlvs[ROCKER_TLV_TX_TSO_MSS]);
 +}
 +
 +if (tlvs[ROCKER_TLV_TX_TSO_HDR_LEN]) {
 +tx_tso_hdr_len = 
 rocker_tlv_get_le16(tlvs[ROCKER_TLV_TX_TSO_HDR_LEN]);
 +}
 +
 +rocker_tlv_for_each_nested(tlv_frag, tlvs[ROCKER_TLV_TX_FRAGS], rem) {
 +hwaddr frag_addr;
 +uint16_t frag_len;
 +
 +if

Re: [Qemu-devel] [PATCH] s390x/pci: fix 2 bugs found by coverity

2015-01-16 Thread Cornelia Huck

On Wed, 14 Jan 2015 16:20:47 +0100
Frank Blaschka blasc...@linux.vnet.ibm.com wrote:

 Signed-off-by: Frank Blaschka blasc...@linux.vnet.ibm.com
 ---
  hw/s390x/s390-pci-bus.c  | 1 +
  hw/s390x/s390-pci-inst.c | 2 +-
  2 files changed, 2 insertions(+), 1 deletion(-)

Applied with a slightly tweaked description:

s390x/pci: fix two bugs found by coverity

Fix a resource leak and a store with the wrong length.

Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com
Signed-off-by: Frank Blaschka blasc...@linux.vnet.ibm.com
Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com

Re: [Qemu-devel] [PATCH 2/4] console: add opengl rendering helper functions

2015-01-16 Thread Gerd Hoffmann

  Hi,

  +bool console_gl_check_format(DisplayChangeListener *dcl,
  + pixman_format_code_t format)
  +{
  +switch (format) {
  +case PIXMAN_x8r8g8b8:
  +case PIXMAN_a8r8g8b8:
  +case PIXMAN_r5g6b5:
  +return true;
  +default:
  +return false;
  +}
  +}
 
 What is this function supposed to be used for?

Inflight change, there is a patch series on the list adding a
check_format callback to DisplayChangeListenerOps, for format
negotiation.

  +
  +void surface_gl_create_texture(DisplaySurface *surface)
  +{
  +switch (surface-format) {
  +case PIXMAN_x8r8g8b8:
  +case PIXMAN_a8r8g8b8:
  +surface-glformat = GL_BGRA;
 
 Why does the format code seem to imply ARGB order but you're using the 
 exact opposite? I could imagine something like endianness being the 
 reason, but you're using RGB below where the format code says exactly that.

endianness indeed.  pixman formats are native endian, so this actually
is bgra ordering in memory.  While looking at it:  I guess GL_BGRA is
fixed byte ordering?  So this probably needs fixing to work correctly on
bigendian machines ...

 Discarding the alpha channel is intentional, I suppose?

Yes.

 
  + surface_width(surface),
  + surface_height(surface),
  + 0, surface-glformat, surface-gltype,
  + surface_data(surface));
 
 Is surface_stride(surface) specified to be surface_width(surface) * 
 bytes_per_pixel (I don't know pixman so I don't know)?

Usually this is the case, but there can be exceptions.  Hmm, can I
explicitly pass the stride?

thanks,
  Gerd

Re: [Qemu-devel] [PATCH v3 7/9] qmp: add rocker device support

2015-01-16 Thread Jason Wang


On 01/11/2015 11:57 AM, sfel...@gmail.com wrote:
 From: Scott Feldman sfel...@gmail.com

 Add QMP/HMP support for rocker devices.  This is mostly for debugging purposes
 to see inside the device's tables and port configurations.  Some examples:

 (qemu) info rocker sw1
 name: sw1
 id: 0x013512005452
 ports: 4

 (qemu) info rocker-ports sw1
 ena/speed/ auto
   port  linkduplex neg?
  sw1.1  up 10G  FD  No
  sw1.2  up 10G  FD  No
  sw1.3  !ena   10G  FD  No
  sw1.4  !ena   10G  FD  No

 (qemu) info rocker-of-dpa-flows sw1
 prio tbl hits key(mask) -- actions
 260   lport 1 vlan 1 LLDP src 00:02:00:00:02:00 dst 01:80:c2:00:00:0e
 260   lport 1 vlan 1 ARP src 00:02:00:00:02:00 dst 00:02:00:00:03:00
 260   lport 2 vlan 2 IPv6 src 00:02:00:00:03:00 dst 33:33:ff:00:00:02 
 proto 58
 350   vlan 2 dst 33:33:ff:00:00:02 -- write group 0x3201 goto 
 tbl 60
 260   lport 2 vlan 2 IPv6 src 00:02:00:00:03:00 dst 33:33:ff:00:03:00 
 proto 58
 350  1vlan 2 dst 33:33:ff:00:03:00 -- write group 0x3201 goto 
 tbl 60
 260   lport 2 vlan 2 ARP src 00:02:00:00:03:00 dst 00:02:00:00:02:00
 350  2vlan 2 dst 00:02:00:00:02:00 -- write group 0x0201 goto 
 tbl 60
 260  1lport 2 vlan 2 IP src 00:02:00:00:03:00 dst 00:02:00:00:02:00 
 proto 1
 350  2vlan 1 dst 00:02:00:00:03:00 -- write group 0x0102 goto 
 tbl 60
 260  1lport 1 vlan 1 IP src 00:02:00:00:02:00 dst 00:02:00:00:03:00 
 proto 1
 260   lport 1 vlan 1 IPv6 src 00:02:00:00:02:00 dst 33:33:ff:00:00:01 
 proto 58
 350   vlan 1 dst 33:33:ff:00:00:01 -- write group 0x3100 goto 
 tbl 60
 260   lport 1 vlan 1 IPv6 src 00:02:00:00:02:00 dst 33:33:ff:00:02:00 
 proto 58
 350  1vlan 1 dst 33:33:ff:00:02:00 -- write group 0x3100 goto 
 tbl 60
 160  173  lport 2 vlan 2 LLDP src any dst 01:80:c2:00:00:0e -- write 
 group 0x0200
 160  6lport 2 vlan 2 IPv6 src any dst any -- write group 
 0x0200
 160  174  lport 1 vlan 1 LLDP src any dst 01:80:c2:00:00:0e -- write 
 group 0x0100
 160  174  lport 2 vlan 2 IP src any dst any -- write group 0x0200
 160  6lport 1 vlan 1 IPv6 src any dst any -- write group 
 0x0100
 160  181  lport 2 vlan 2 ARP src any dst any -- write group 
 0x0200
 110  715  lport 2 -- apply new vlan 2 goto tbl 20
 160  177  lport 1 vlan 1 ARP src any dst any -- write group 
 0x0100
 160  174  lport 1 vlan 1 IP src any dst any -- write group 0x0100
 110  717  lport 1 -- apply new vlan 1 goto tbl 20
 10   1432 lport 0(0x) -- goto tbl 10

 (qemu) info rocker-of-dpa-groups sw1
 id (decode) -- buckets
 0x3201 (type L2 multicast vlan 2 index 1) -- groups 
 [0x0201,0x0200]
 0x0201 (type L2 interface vlan 2 lport 1) -- pop vlan out lport 1
 0x0102 (type L2 interface vlan 1 lport 2) -- pop vlan out lport 2
 0x0200 (type L2 interface vlan 2 lport 0) -- pop vlan out lport 0
 0x0100 (type L2 interface vlan 1 lport 0) -- pop vlan out lport 0
 0x3100 (type L2 multicast vlan 1 index 0) -- groups 
 [0x0102,0x0100]

 Signed-off-by: Scott Feldman sfel...@gmail.com
 Signed-off-by: Jiri Pirko j...@resnulli.us
 ---
  hmp-commands.hx   |   24 
  hmp.c |  303 
  hmp.h |4 +
  hw/net/rocker/rocker.c|   45 ++
  hw/net/rocker/rocker_fp.c |   10 ++
  hw/net/rocker/rocker_fp.h |1 +
  hw/net/rocker/rocker_of_dpa.c |  308 
 +
  monitor.c |   28 
  qapi-schema.json  |3 +
  qapi/rocker.json  |  259 ++
  qmp-commands.hx   |   97 +
  11 files changed, 1082 insertions(+)
  create mode 100644 qapi/rocker.json

I don't object this method but not sure it was worth (especially
consider it needs nearly a thousand lines of codes).

Could we just do this through remote controller or agent?

Thanks

Re: [Qemu-devel] [PATCH v3 6/9] rocker: add new rocker switch device

2015-01-16 Thread Scott Feldman

On Fri, Jan 16, 2015 at 1:15 AM, Jason Wang jasow...@redhat.com wrote:

 On 01/11/2015 11:57 AM, sfel...@gmail.com wrote:
 Each port is a netdev and can be paired with using -netdev id=port name.

 Signed-off-by: Scott Feldman sfel...@gmail.com
 Signed-off-by: Jiri Pirko j...@resnulli.us
 ---

 Looks like the devices does not support state saving. How about tagging
 it with unmigratable first?

State saving would be nice to add in the future.  How do we tag is
unmigratable for now?

 +uint64_t switch_id;  /* switch id */

 Looks like the function of name ad switch_id are duplicated?

Don't follow...

 +if (iovcnt) {
 +/* XXX perform Tx offloads */
 +/* XXX   silence compiler for now */

 Need add TSO here (e1000 is a good reference) or disable TSO in driver.

TSO is not enabled in driver, currently.  This is lower priority now
since the I/O path for traffic to/from the guest CPU is not
performance critical.  The forwarding path offloaded by the device is
the hot path, but that doesn't involve I/O to guest CPU, so Tx/Rx
offloads aren't in play.

But, someday, it would be nice to enable Rx/Tx offloads.

 +rocker_tlv_put_le32(buf, pos, ROCKER_TLV_EVENT_TYPE,
 +ROCKER_TLV_EVENT_TYPE_LINK_CHANGED);
 +nest = rocker_tlv_nest_start(buf, pos, ROCKER_TLV_EVENT_INFO);
 +rocker_tlv_put_le32(buf, pos, ROCKER_TLV_EVENT_LINK_CHANGED_PPORT, 
 pport);
 +rocker_tlv_put_u8(buf, pos, ROCKER_TLV_EVENT_LINK_CHANGED_LINKUP,
 +  link_up ? 1 : 0);

 Looks like those types are not documented.

Ok, I'll double check spec to make sure we're covered, for the items you found.

 +rocker_tlv_put_le16(buf, pos, ROCKER_TLV_RX_FLAGS, rx_flags);
 +rocker_tlv_put_le16(buf, pos, ROCKER_TLV_RX_CSUM, rx_csum);

 Note: Some backend (e.g tap) can do offloading, may consider to add the
 support in the future.

Yup, on TODO list for Rx/Tx offloads.

 Using only 1 queues is ok for multiqueue backend. In the future may
 consider to use all.

When we get to QoS for working, I think we'll want to enable these
multiqueues.  QoS support is not implemented in rocker, currently, but
there is support defined in the OF-DPA spec as part of ACL processing.
So this is future work.

 Thanks

Thanks for reviewing.

Re: [Qemu-devel] [PATCH v3 7/9] qmp: add rocker device support

2015-01-16 Thread Scott Feldman

On Fri, Jan 16, 2015 at 1:26 AM, Jason Wang jasow...@redhat.com wrote:

 On 01/11/2015 11:57 AM, sfel...@gmail.com wrote:
 From: Scott Feldman sfel...@gmail.com

 Add QMP/HMP support for rocker devices.  This is mostly for debugging 
 purposes
 to see inside the device's tables and port configurations.  Some examples:

 (qemu) info rocker sw1
 name: sw1
 id: 0x013512005452
 ports: 4

 (qemu) info rocker-ports sw1
 ena/speed/ auto
   port  linkduplex neg?
  sw1.1  up 10G  FD  No
  sw1.2  up 10G  FD  No
  sw1.3  !ena   10G  FD  No
  sw1.4  !ena   10G  FD  No

 (qemu) info rocker-of-dpa-flows sw1
 prio tbl hits key(mask) -- actions
 260   lport 1 vlan 1 LLDP src 00:02:00:00:02:00 dst 01:80:c2:00:00:0e
 260   lport 1 vlan 1 ARP src 00:02:00:00:02:00 dst 00:02:00:00:03:00
 260   lport 2 vlan 2 IPv6 src 00:02:00:00:03:00 dst 
 33:33:ff:00:00:02 proto 58
 350   vlan 2 dst 33:33:ff:00:00:02 -- write group 0x3201 goto 
 tbl 60
 260   lport 2 vlan 2 IPv6 src 00:02:00:00:03:00 dst 
 33:33:ff:00:03:00 proto 58
 350  1vlan 2 dst 33:33:ff:00:03:00 -- write group 0x3201 goto 
 tbl 60
 260   lport 2 vlan 2 ARP src 00:02:00:00:03:00 dst 00:02:00:00:02:00
 350  2vlan 2 dst 00:02:00:00:02:00 -- write group 0x0201 goto 
 tbl 60
 260  1lport 2 vlan 2 IP src 00:02:00:00:03:00 dst 00:02:00:00:02:00 
 proto 1
 350  2vlan 1 dst 00:02:00:00:03:00 -- write group 0x0102 goto 
 tbl 60
 260  1lport 1 vlan 1 IP src 00:02:00:00:02:00 dst 00:02:00:00:03:00 
 proto 1
 260   lport 1 vlan 1 IPv6 src 00:02:00:00:02:00 dst 
 33:33:ff:00:00:01 proto 58
 350   vlan 1 dst 33:33:ff:00:00:01 -- write group 0x3100 goto 
 tbl 60
 260   lport 1 vlan 1 IPv6 src 00:02:00:00:02:00 dst 
 33:33:ff:00:02:00 proto 58
 350  1vlan 1 dst 33:33:ff:00:02:00 -- write group 0x3100 goto 
 tbl 60
 160  173  lport 2 vlan 2 LLDP src any dst 01:80:c2:00:00:0e -- write 
 group 0x0200
 160  6lport 2 vlan 2 IPv6 src any dst any -- write group 
 0x0200
 160  174  lport 1 vlan 1 LLDP src any dst 01:80:c2:00:00:0e -- write 
 group 0x0100
 160  174  lport 2 vlan 2 IP src any dst any -- write group 
 0x0200
 160  6lport 1 vlan 1 IPv6 src any dst any -- write group 
 0x0100
 160  181  lport 2 vlan 2 ARP src any dst any -- write group 
 0x0200
 110  715  lport 2 -- apply new vlan 2 goto tbl 20
 160  177  lport 1 vlan 1 ARP src any dst any -- write group 
 0x0100
 160  174  lport 1 vlan 1 IP src any dst any -- write group 
 0x0100
 110  717  lport 1 -- apply new vlan 1 goto tbl 20
 10   1432 lport 0(0x) -- goto tbl 10

 (qemu) info rocker-of-dpa-groups sw1
 id (decode) -- buckets
 0x3201 (type L2 multicast vlan 2 index 1) -- groups 
 [0x0201,0x0200]
 0x0201 (type L2 interface vlan 2 lport 1) -- pop vlan out lport 1
 0x0102 (type L2 interface vlan 1 lport 2) -- pop vlan out lport 2
 0x0200 (type L2 interface vlan 2 lport 0) -- pop vlan out lport 0
 0x0100 (type L2 interface vlan 1 lport 0) -- pop vlan out lport 0
 0x3100 (type L2 multicast vlan 1 index 0) -- groups 
 [0x0102,0x0100]

 Signed-off-by: Scott Feldman sfel...@gmail.com
 Signed-off-by: Jiri Pirko j...@resnulli.us
 ---
  hmp-commands.hx   |   24 
  hmp.c |  303 
 
  hmp.h |4 +
  hw/net/rocker/rocker.c|   45 ++
  hw/net/rocker/rocker_fp.c |   10 ++
  hw/net/rocker/rocker_fp.h |1 +
  hw/net/rocker/rocker_of_dpa.c |  308 
 +
  monitor.c |   28 
  qapi-schema.json  |3 +
  qapi/rocker.json  |  259 ++
  qmp-commands.hx   |   97 +
  11 files changed, 1082 insertions(+)
  create mode 100644 qapi/rocker.json

 I don't object this method but not sure it was worth (especially
 consider it needs nearly a thousand lines of codes).

I wasn't happy about touching core files to add rocker support.
Splitting device-specific stuff out into qapi/rocker.json helped.
Seems qmp/hmp aren't really setup for devices to plug-in their stuff
in in a modular way.  I couldn't find other examples of devices using
qmp/hmp to the extent rocker is, but maybe I didn't look hard enough.

However, QMP/HMP support is very handy for debugging rocker device
since there is tons of state stored in the switch forwarding tables,
and it can be quite dynamic depending on what's going on on the guest
OS w.r.t networking.

 Could we just do this through remote controller or agent?

I'm open to ideas.  qmp/hmp work great for getting an instantaneous
snapshot of the device state.  The JSON format goodness come for free
with qmp, so that's nice.

-scott

Re: [Qemu-devel] kvmclock, Migration, and NTP clock jitter

2015-01-16 Thread Mohammed Gamal

On Thu, Jan 15, 2015 at 06:27:54PM +0100, Paolo Bonzini wrote:
 
 
 On 15/01/2015 17:39, Mohammed Gamal wrote:
  The increase in the jitter and offset values is well within the 500 ppm
  frequency tolerance limit, and therefore are easily corrected by
  subsequent NTP clock sync events, but some live migrations do cause much
  higher jitter and offset jumps, which can not be corrected by NTP and
  cause the time to go way off. Any idea why this is the case?
 
 It might be fixed in QEMU 2.2.
 
 See https://lists.gnu.org/archive/html/qemu-devel/2014-09/msg01239.html
 
 Paolo

Hi Paolo,

I did try to backport these patches to qemu 1.2. However, migrations 
resulted in *higher* jitter and offset values (i.e. in the order of 100+ ppm).
I am not sure if I've done the backporting correctly though. Here are my
patches on top of the qemu 1.2 stable tree.
diff --git a/cpus.c b/cpus.c
index 29aced5..e079ee5 100644
--- a/cpus.c
+++ b/cpus.c
@@ -187,6 +187,15 @@ void cpu_disable_ticks(void)
 }
 }
 
+void cpu_clean_all_dirty(void)
+{
+CPUArchState *cpu;
+
+for (cpu = first_cpu; cpu; cpu = cpu-next_cpu) {
+cpu_clean_state(cpu);
+}
+}
+
 /* Correlation between real and virtual time is always going to be
fairly approximate, so ignore small variation.
When the guest is idle real and virtual time will be aligned in
diff --git a/cpus.h b/cpus.h
index 3fc1a4a..1ff166b 100644
--- a/cpus.h
+++ b/cpus.h
@@ -12,6 +12,7 @@ void unplug_vcpu(void *p);
 void cpu_synchronize_all_states(void);
 void cpu_synchronize_all_post_reset(void);
 void cpu_synchronize_all_post_init(void);
+void cpu_clean_all_dirty(void);
 
 void qtest_clock_warp(int64_t dest);
 
diff --git a/hw/kvm/clock.c b/hw/kvm/clock.c
index 824b978..b2bdda4 100644
--- a/hw/kvm/clock.c
+++ b/hw/kvm/clock.c
@@ -16,6 +16,8 @@
 #include qemu-common.h
 #include sysemu.h
 #include kvm.h
+#include host-utils.h
+#include cpus.h
 #include hw/sysbus.h
 #include hw/kvm/clock.h
 
@@ -28,6 +30,46 @@ typedef struct KVMClockState {
 bool clock_valid;
 } KVMClockState;
 
+struct pvclock_vcpu_time_info {
+uint32_t   version;
+uint32_t   pad0;
+uint64_t   tsc_timestamp;
+uint64_t   system_time;
+uint32_t   tsc_to_system_mul;
+int8_t tsc_shift;
+uint8_tflags;
+uint8_tpad[2];
+} __attribute__((__packed__)); /* 32 bytes */
+
+static uint64_t kvmclock_current_nsec(KVMClockState *s)
+{
+CPUArchState *env = first_cpu;
+uint64_t migration_tsc = env-tsc;
+struct pvclock_vcpu_time_info time;
+uint64_t delta;
+uint64_t nsec_lo;
+uint64_t nsec_hi;
+uint64_t nsec;
+
+if (!(env-system_time_msr  1ULL)) {
+/* KVM clock not active */
+return 0;
+}
+cpu_physical_memory_read((env-system_time_msr  ~1ULL), time, sizeof(time));
+
+assert(time.tsc_timestamp = migration_tsc);
+delta = migration_tsc - time.tsc_timestamp;
+if (time.tsc_shift  0) {
+delta = -time.tsc_shift;
+} else {
+delta = time.tsc_shift;
+}
+
+mulu64(nsec_lo, nsec_hi, delta, time.tsc_to_system_mul);
+nsec = (nsec_lo  32) | (nsec_hi  32);
+return nsec + time.system_time;
+}
+
 static void kvmclock_pre_save(void *opaque)
 {
 KVMClockState *s = opaque;
@@ -37,6 +79,23 @@ static void kvmclock_pre_save(void *opaque)
 if (s-clock_valid) {
 return;
 }
+
+cpu_synchronize_all_states();
+/* In theory, the cpu_synchronize_all_states() call above wouldn't
+ * affect the rest of the code, as the VCPU state inside CPUArchState
+ * is supposed to always match the VCPU state on the kernel side.
+ *
+ * In practice, calling cpu_synchronize_state() too soon will load the
+ * kernel-side APIC state into X86CPU.apic_state too early, APIC state
+ * won't be reloaded later because CPUState.vcpu_dirty==true, and
+ * outdated APIC state may be migrated to another host.
+ *
+ * The real fix would be to make sure outdated APIC state is read
+ * from the kernel again when necessary. While this is not fixed, we
+ * need the cpu_clean_all_dirty() call below.
+ */
+cpu_clean_all_dirty();
+
 ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data);
 if (ret  0) {
 fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret));
@@ -55,6 +114,12 @@ static int kvmclock_post_load(void *opaque, int version_id)
 {
 KVMClockState *s = opaque;
 struct kvm_clock_data data;
+uint64_t time_at_migration = kvmclock_current_nsec(s);
+
+/* We can't rely on the migrated clock value, just discard it */
+if (time_at_migration) {
+s-clock = time_at_migration;
+}
 
 data.clock = s-clock;
 data.flags = 0;
diff --git a/kvm-all.c b/kvm-all.c
index cd2ccbe..692944e 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1547,6 +1547,11 @@ void kvm_cpu_synchronize_post_init(CPUArchState *env)
 env-kvm_vcpu_dirty = 0;
 }
 
+void kvm_cpu_clean_state(CPUArchState *env)
+{
+env-kvm_vcpu_dirty = false;
+}
+

Re: [Qemu-devel] [PATCH 3/8] rcu: add rcutorture

2015-01-16 Thread Paolo Bonzini



On 16/01/2015 03:04, Fam Zheng wrote:
 On Tue, 01/13 18:52, Paolo Bonzini wrote:
 +int main(int argc, char *argv[])
 +{
 +int nreaders = 1;
 +int duration = 1;
 +
 +if (argc = 2  argv[1][0] == '-') {
 +g_test_init(argc, argv, NULL);
 +g_test_add_func(/rcu/torture/short/1reader, gtest_stress_1_1);
 +g_test_add_func(/rcu/torture/short/10readers, gtest_stress_10_1);
 +g_test_add_func(/rcu/torture/long/1reader, gtest_stress_1_5);
 +g_test_add_func(/rcu/torture/long/10readers, gtest_stress_10_5);
 
 Why do we need short tests when we have long tests? What are different other
 than the durations?

Nothing really.  I guess long tests could be made optional (-m slow).

 +return g_test_run();
 +}
 +
 +if (argc = 2) {
 +nreaders = strtoul(argv[1], NULL, 0);
 +}
 +if (argc  3) {
 +duration = strtoul(argv[3], NULL, 0);
 +}
 +if (argc  3 || strcmp(argv[2], stress) == 0) {
 +stresstest(nreaders, duration);
 +} else if (strcmp(argv[2], rperf) == 0) {
 +rperftest(nreaders, duration);
 +} else if (strcmp(argv[2], uperf) == 0) {
 +uperftest(nreaders, duration);
 +} else if (strcmp(argv[2], perf) == 0) {
 +perftest(nreaders, duration);
 +}
 +usage(argc, argv);
 +return 0;
 +}
 diff --git a/util/rcu.c b/util/rcu.c
 index f8f52ae..8df16d9 100644
 --- a/util/rcu.c
 +++ b/util/rcu.c
 @@ -169,5 +169,4 @@ static void __attribute__((__constructor__)) 
 rcu_init(void)
  {
  qemu_mutex_init(rcu_gp_lock);
  qemu_event_init(rcu_gp_event, true);
 -rcu_register_thread();
 
 Maybe don't add this line in patch 2?

More like don't remove it in patch 3. ;)

Paolo

Re: [Qemu-devel] [PULL] migration patches

2015-01-16 Thread Peter Maydell

On 16 January 2015 at 07:47, Amit Shah amit.s...@redhat.com wrote:
 The following changes since commit df58887b20fab8fe8a6dcca4db30cd4e4077d53a:

   Merge remote-tracking branch 
 'remotes/mjt/tags/pull-trivial-patches-2015-01-15' into staging (2015-01-15 
 10:08:46 +)

 are available in the git repository at:

   git://git.kernel.org/pub/scm/virt/qemu/amit/migration.git tags/mig-2.3-1

 for you to fetch changes up to ea987c2c21d4326bb58ee28f6888fdcf8fbda067:

   vmstate: type-check sub-arrays (2015-01-16 13:06:17 +0530)

 
 A set of patches collected over the holidays.  Mix of optimizations and
 fixes.

 

Applied, thanks.

-- PMM

[Qemu-devel] [PULL 1/2] target-arm: crypto: fix BE host support

2015-01-16 Thread Peter Maydell

From: Ard Biesheuvel ard.biesheu...@linaro.org

The crypto emulation code in target-arm/crypto_helper.c never worked
correctly on big endian hosts, due to the fact that it uses a union
of array types to convert between the native VFP register size (64
bits) and the types used in the algorithms (bytes and 32 bit words)

We cannot just swab between LE and BE when reading and writing the
registers, as the SHA code performs word additions, so instead, add
array accessors for the CRYPTO_STATE type whose LE and BE specific
implementations ensure that the correct array elements are referenced.

Signed-off-by: Ard Biesheuvel ard.biesheu...@linaro.org
Acked-by: Laszlo Ersek ler...@redhat.com
Message-id: 1420208303-24111-1-git-send-email-ard.biesheu...@linaro.org
Reviewed-by: Peter Maydell peter.mayd...@linaro.org
Signed-off-by: Peter Maydell peter.mayd...@linaro.org
---
 target-arm/crypto_helper.c | 114 +
 1 file changed, 63 insertions(+), 51 deletions(-)

diff --git a/target-arm/crypto_helper.c b/target-arm/crypto_helper.c
index dd60d0b..1fe975d 100644
--- a/target-arm/crypto_helper.c
+++ b/target-arm/crypto_helper.c
@@ -22,6 +22,14 @@ union CRYPTO_STATE {
 uint64_t   l[2];
 };
 
+#ifdef HOST_WORDS_BIGENDIAN
+#define CR_ST_BYTE(state, i)   (state.bytes[(15 - (i)) ^ 8])
+#define CR_ST_WORD(state, i)   (state.words[(3 - (i)) ^ 2])
+#else
+#define CR_ST_BYTE(state, i)   (state.bytes[i])
+#define CR_ST_WORD(state, i)   (state.words[i])
+#endif
+
 void HELPER(crypto_aese)(CPUARMState *env, uint32_t rd, uint32_t rm,
  uint32_t decrypt)
 {
@@ -46,7 +54,7 @@ void HELPER(crypto_aese)(CPUARMState *env, uint32_t rd, 
uint32_t rm,
 
 /* combine ShiftRows operation and sbox substitution */
 for (i = 0; i  16; i++) {
-st.bytes[i] = sbox[decrypt][rk.bytes[shift[decrypt][i]]];
+CR_ST_BYTE(st, i) = sbox[decrypt][CR_ST_BYTE(rk, shift[decrypt][i])];
 }
 
 env-vfp.regs[rd] = make_float64(st.l[0]);
@@ -198,11 +206,11 @@ void HELPER(crypto_aesmc)(CPUARMState *env, uint32_t rd, 
uint32_t rm,
 assert(decrypt  2);
 
 for (i = 0; i  16; i += 4) {
-st.words[i  2] = cpu_to_le32(
-mc[decrypt][st.bytes[i]] ^
-rol32(mc[decrypt][st.bytes[i + 1]], 8) ^
-rol32(mc[decrypt][st.bytes[i + 2]], 16) ^
-rol32(mc[decrypt][st.bytes[i + 3]], 24));
+CR_ST_WORD(st, i  2) =
+mc[decrypt][CR_ST_BYTE(st, i)] ^
+rol32(mc[decrypt][CR_ST_BYTE(st, i + 1)], 8) ^
+rol32(mc[decrypt][CR_ST_BYTE(st, i + 2)], 16) ^
+rol32(mc[decrypt][CR_ST_BYTE(st, i + 3)], 24);
 }
 
 env-vfp.regs[rd] = make_float64(st.l[0]);
@@ -255,24 +263,25 @@ void HELPER(crypto_sha1_3reg)(CPUARMState *env, uint32_t 
rd, uint32_t rn,
 
 switch (op) {
 case 0: /* sha1c */
-t = cho(d.words[1], d.words[2], d.words[3]);
+t = cho(CR_ST_WORD(d, 1), CR_ST_WORD(d, 2), CR_ST_WORD(d, 3));
 break;
 case 1: /* sha1p */
-t = par(d.words[1], d.words[2], d.words[3]);
+t = par(CR_ST_WORD(d, 1), CR_ST_WORD(d, 2), CR_ST_WORD(d, 3));
 break;
 case 2: /* sha1m */
-t = maj(d.words[1], d.words[2], d.words[3]);
+t = maj(CR_ST_WORD(d, 1), CR_ST_WORD(d, 2), CR_ST_WORD(d, 3));
 break;
 default:
 g_assert_not_reached();
 }
-t += rol32(d.words[0], 5) + n.words[0] + m.words[i];
-
-n.words[0] = d.words[3];
-d.words[3] = d.words[2];
-d.words[2] = ror32(d.words[1], 2);
-d.words[1] = d.words[0];
-d.words[0] = t;
+t += rol32(CR_ST_WORD(d, 0), 5) + CR_ST_WORD(n, 0)
+ + CR_ST_WORD(m, i);
+
+CR_ST_WORD(n, 0) = CR_ST_WORD(d, 3);
+CR_ST_WORD(d, 3) = CR_ST_WORD(d, 2);
+CR_ST_WORD(d, 2) = ror32(CR_ST_WORD(d, 1), 2);
+CR_ST_WORD(d, 1) = CR_ST_WORD(d, 0);
+CR_ST_WORD(d, 0) = t;
 }
 }
 env-vfp.regs[rd] = make_float64(d.l[0]);
@@ -286,8 +295,8 @@ void HELPER(crypto_sha1h)(CPUARMState *env, uint32_t rd, 
uint32_t rm)
 float64_val(env-vfp.regs[rm + 1])
 } };
 
-m.words[0] = ror32(m.words[0], 2);
-m.words[1] = m.words[2] = m.words[3] = 0;
+CR_ST_WORD(m, 0) = ror32(CR_ST_WORD(m, 0), 2);
+CR_ST_WORD(m, 1) = CR_ST_WORD(m, 2) = CR_ST_WORD(m, 3) = 0;
 
 env-vfp.regs[rd] = make_float64(m.l[0]);
 env-vfp.regs[rd + 1] = make_float64(m.l[1]);
@@ -304,10 +313,10 @@ void HELPER(crypto_sha1su1)(CPUARMState *env, uint32_t 
rd, uint32_t rm)
 float64_val(env-vfp.regs[rm + 1])
 } };
 
-d.words[0] = rol32(d.words[0] ^ m.words[1], 1);
-d.words[1] = rol32(d.words[1] ^ m.words[2], 1);
-d.words[2] = rol32(d.words[2] ^ m.words[3], 1);
-d.words[3] = rol32(d.words[3] ^ d.words[0],

[Qemu-devel] [PULL 0/2] target-arm queue

2015-01-16 Thread Peter Maydell

A short queue, but I don't want to sit on these fixes any longer.

-- PMM


The following changes since commit e68cba36360a2ab5bf0576b66df4d0eb0d822f8d:

  Merge remote-tracking branch 'remotes/amit-migration/tags/mig-2.3-1' into 
staging (2015-01-16 10:16:14 +)

are available in the git repository at:


  git://git.linaro.org/people/pmaydell/qemu-arm.git 
tags/pull-target-arm-20150116

for you to fetch changes up to 36b62ae6a58f9a588fd33be9386e18a2b90103f5:

  fw_cfg: fix endianness in fw_cfg_data_mem_read() / _write() (2015-01-16 
11:54:30 +)


target-arm queue:
 * fix endianness handling in fwcfg wide registers
 * fix broken crypto insn emulation on big endian hosts


Ard Biesheuvel (1):
  target-arm: crypto: fix BE host support

Laszlo Ersek (1):
  fw_cfg: fix endianness in fw_cfg_data_mem_read() / _write()

 hw/nvram/fw_cfg.c  |  41 +++-
 target-arm/crypto_helper.c | 114 +
 2 files changed, 70 insertions(+), 85 deletions(-)

[Qemu-devel] [PULL 2/2] fw_cfg: fix endianness in fw_cfg_data_mem_read() / _write()

2015-01-16 Thread Peter Maydell

From: Laszlo Ersek ler...@redhat.com

(1) Let's contemplate what device endianness means, for a memory mapped
device register (independently of QEMU -- that is, on physical hardware).

It determines the byte order that the device will put on the data bus when
the device is producing a *numerical value* for the CPU. This byte order
may differ from the CPU's own byte order, therefore when software wants to
consume the *numerical value*, it may have to swap the byte order first.

For example, suppose we have a device that exposes in a 2-byte register
the number of sheep we have to count before falling asleep. If the value
is decimal 37 (0x0025), then a big endian register will produce [0x00,
0x25], while a little endian register will produce [0x25, 0x00].

If the device register is big endian, but the CPU is little endian, the
numerical value will read as 0x2500 (decimal 9472), which software has to
byte swap before use.

However... if we ask the device about who stole our herd of sheep, and it
answers XY, then the byte representation coming out of the register must
be [0x58, 0x59], regardless of the device register's endianness for
numeric values. And, software needs to copy these bytes into a string
field regardless of the CPU's own endianness.

(2) QEMU's device register accessor functions work with *numerical values*
exclusively, not strings:

The emulated register's read accessor function returns the numerical value
(eg. 37 decimal, 0x0025) as a *host-encoded* uint64_t. QEMU translates
this value for the guest to the endianness of the emulated device register
(which is recorded in MemoryRegionOps.endianness). Then guest code must
translate the numerical value from device register to guest CPU
endianness, before including it in any computation (see (1)).

(3) However, the data register of the fw_cfg device shall transfer strings
*only* -- that is, opaque blobs. Interpretation of any given blob is
subject to further agreement -- it can be an integer in an independently
determined byte order, or a genuine string, or an array of structs of
integers (in some byte order) and fixed size strings, and so on.

Because register emulation in QEMU is integer-preserving, not
string-preserving (see (2)), we have to jump through a few hoops.

(3a) We defined the memory mapped fw_cfg data register as
DEVICE_BIG_ENDIAN.

The particular choice is not really relevant -- we picked BE only for
consistency with the control register, which *does* transfer integers --
but our choice affects how we must host-encode values from fw_cfg strings.

(3b) Since we want the fw_cfg string XY to appear as the [0x58, 0x59]
array on the data register, *and* we picked DEVICE_BIG_ENDIAN, we must
compose the host (== C language) value 0x5859 in the read accessor
function.

(3c) When the guest performs the read access, the immediate uint16_t value
will be 0x5958 (in LE guests) and 0x5859 (in BE guests). However, the
uint16_t value does not matter. The only thing that matters is the byte
pattern [0x58, 0x59], which the guest code must copy into the target
string *without* any byte-swapping.

(4) Now I get to explain where I screwed up. :(

When we decided for big endian *integer* representation in the MMIO data
register -- see (3a) --, I mindlessly added an indiscriminate
byte-swizzling step to the (little endian) guest firmware.

This was a grave error -- it violates (3c) --, but I didn't realize it. I
only saw that the code I otherwise intended for fw_cfg_data_mem_read():

value = 0;
for (i = 0; i  size; ++i) {
value = (value  8) | fw_cfg_read(s);
}

didn't produce the expected result in the guest.

In true facepalm style, instead of blaming my guest code (which violated
(3c)), I blamed my host code (which was correct). Ultimately, I coded
ldX_he_p() into fw_cfg_data_mem_read(), because that happened to work.

Obviously (...in retrospect) that was wrong. Only because my host happened
to be LE, ldX_he_p() composed the (otherwise incorrect) host value 0x5958
from the fw_cfg string XY. And that happened to compensate for the bogus
indiscriminate byte-swizzling in my guest code.

Clearly the current code leaks the host endianness through to the guest,
which is wrong. Any device should work the same regardless of host
endianness.

The solution is to compose the host-endian representation (2) of the big
endian interpretation (3a, 3b) of the fw_cfg string, and to drop the wrong
byte-swizzling in the guest (3c).

Brown paper bag time for me.

Signed-off-by: Laszlo Ersek ler...@redhat.com
Message-id: 1420024880-15416-1-git-send-email-ler...@redhat.com
Reviewed-by: Peter Maydell peter.mayd...@linaro.org
Signed-off-by: Peter Maydell peter.mayd...@linaro.org
---
 hw/nvram/fw_cfg.c | 41 +++--
 1 file changed, 7 insertions(+), 34 deletions(-)

diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index fcdf821..78a37be 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -287,51 +287,24 @@ static

Re: [Qemu-devel] Multi GPU passthrough via VFIO

2015-01-16 Thread Maik Broemme

Hi Alex,

Maik Broemme mbroe...@parallels.com wrote:
 Hi Alex,
 
 Maik Broemme mbroe...@parallels.com wrote:
  Hi Alex,
  
  Alex Williamson alex.william...@redhat.com wrote:
   On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
Hi Alex,

Maik Broemme mbroe...@parallels.com wrote:
 Hi Alex,
 
 Alex Williamson alex.william...@redhat.com wrote:
  On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
   Interesting is the diff between 1st and 2nd boot, so if I do the 
   lspci
   prior to the booting. The only difference between 1st start and 
   2nd
   start are:
   
   --- 001-lspci.290x.before.1st.log 2014-02-07 01:13:41.498827928 
   +0100
   +++ 004-lspci.290x.before.2nd.log 2014-02-07 01:16:50.966611282 
   +0100
   @@ -24,7 +24,7 @@
 ClockPM- Surprise- LLActRep- BwNot-
 LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- 
   CommClk+
 ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
   - LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
   DLActive- BWMgmt- ABWMgmt-
   + LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
   DLActive- BWMgmt- ABWMgmt-
 DevCap2: Completion Timeout: Not Supported, 
   TimeoutDis-, LTR-, OBFF Not Supported
 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, 
   LTR-, OBFF Disabled
 LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
   SpeedDis-
   @@ -33,13 +33,13 @@
 LnkSta2: Current De-emphasis Level: -3.5dB, 
   EqualizationComplete-, EqualizationPhase1-
  EqualizationPhase2-, EqualizationPhase3-, 
   LinkEqualizationRequest-
 Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
   - Address:   Data: 
   + Address: fee0  Data: 
 Capabilities: [100 v1] Vendor Specific Information: ID=0001 
   Rev=1 Len=010 ?
 Capabilities: [150 v2] Advanced Error Reporting
 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
   UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
   UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- 
   UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
   - CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
   NonFatalErr-
   + CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
   NonFatalErr+
 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
   NonFatalErr+
 AERCap: First Error Pointer: 00, GenCap+ CGenEn- 
   ChkCap+ ChkEn-
 Capabilities: [270 v1] #19
   
   After that if I do suspend-to-ram / resume trick I have again 
   lspci
   output from before 1st boot.
  
  The Link Status change after X is stopped seems the most 
  interesting to
  me.  The MSI change is probably explained by the MSI save/restore 
  of the
  device, but should be harmless since MSI is disabled.  I'm a bit
  surprised the Correctable Error Status in the AER capability didn't 
  get
  cleared.  I would have thought that a bus reset would have caused 
  the
  link to retrain back to the original speed/width as well.  Let's 
  check
  that we're actually getting a bus reset, try this in addition to the
  previous qemu patch.  This just enables debug logging for the bus 
  resest
  function.  Thanks,
  
 
 Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
 time X gets killed and oops happened)
 
 - 1st boot:
 
 vfio: vfio_pci_hot_reset(:01:00.1) multi
 vfio: :01:00.1: hot reset dependent devices:
 vfio: :01:00.0 group 1
 vfio: :01:00.1 group 1
 vfio: :01:00.1 hot reset: Success
 vfio: vfio_pci_hot_reset(:01:00.1) one
 vfio: :01:00.1: hot reset dependent devices:
 vfio: :01:00.0 group 1
 vfio: vfio: found another in-use device :01:00.0
 vfio: vfio_pci_hot_reset(:01:00.0) one
 vfio: :01:00.0: hot reset dependent devices:
 vfio: :01:00.0 group 1
 vfio: :01:00.1 group 1
 vfio: vfio: found another in-use device :01:00.1
 
 - 2nd boot:
 
 vfio: vfio_pci_hot_reset(:01:00.1) multi
 vfio: :01:00.1: hot reset dependent devices:
 vfio: :01:00.0 group 1
 vfio: :01:00.1 group 1
 vfio: :01:00.1 hot reset: Success
 vfio: vfio_pci_hot_reset(:01:00.1) one
 vfio: :01:00.1: hot reset dependent devices:
 vfio: :01:00.0 group 1
 vfio: vfio: found another in-use device :01:00.0
 vfio:

Re: [Qemu-devel] [PULL 0/2] target-arm queue

2015-01-16 Thread Peter Maydell

On 16 January 2015 at 12:05, Peter Maydell peter.mayd...@linaro.org wrote:
 A short queue, but I don't want to sit on these fixes any longer.

 -- PMM


 The following changes since commit e68cba36360a2ab5bf0576b66df4d0eb0d822f8d:

   Merge remote-tracking branch 'remotes/amit-migration/tags/mig-2.3-1' into 
 staging (2015-01-16 10:16:14 +)

 are available in the git repository at:


   git://git.linaro.org/people/pmaydell/qemu-arm.git 
 tags/pull-target-arm-20150116

 for you to fetch changes up to 36b62ae6a58f9a588fd33be9386e18a2b90103f5:

   fw_cfg: fix endianness in fw_cfg_data_mem_read() / _write() (2015-01-16 
 11:54:30 +)

 
 target-arm queue:
  * fix endianness handling in fwcfg wide registers
  * fix broken crypto insn emulation on big endian hosts

Applied, thanks.

-- PMM

Re: [Qemu-devel] [PATCH 7/9] block-migration: remove not needed iothread lock

2015-01-16 Thread Vladimir Sementsov-Ogievskiy


On 09.01.2015 00:24, John Snow wrote:



On 12/11/2014 09:17 AM, Vladimir Sementsov-Ogievskiy wrote:

Instead of locking iothread, we can just swap these calls. So, if some
write to our range occures before resetting the bitmap, then it will
get into subsequent aio read, becouse it occures, in any case, after
resetting the bitmap.



s/occures/occurs/g
s/becouse/because/g

(I hope you are not annoyed by the spelling corrections: They are in 
good faith and personally I would hope people would correct any of my 
spelling mistakes before it goes in the git log!)



Signed-off-by: Vladimir Sementsov-Ogievskiy vsement...@parallels.com
---
  block-migration.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/block-migration.c b/block-migration.c
index d0c825f..908a66d 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -315,13 +315,11 @@ static int mig_save_device_bulk(QEMUFile *f, 
BlkMigDevState *bmds)

  block_mig_state.submitted++;
  blk_mig_unlock();

-qemu_mutex_lock_iothread();
+bdrv_reset_dirty_bitmap(bs, bmds-dirty_bitmap, cur_sector, 
nr_sectors);

+
  blk-aiocb = bdrv_aio_readv(bs, cur_sector, blk-qiov,
  nr_sectors, blk_mig_read_cb, blk);

-bdrv_reset_dirty_bitmap(bs, bmds-dirty_bitmap, cur_sector, 
nr_sectors);

-qemu_mutex_unlock_iothread();
-
  bmds-cur_sector = cur_sector + nr_sectors;
  return (bmds-cur_sector = total_sectors);
  }



OK, so the justification here is that by ordering it as reset, read 
that any writes that occur after the reset will be simply included in 
the read, and we won't have reset any bits that we didn't actually get 
a chance to read.


OK.

But what about losing the ability to clear bits that are set needlessly?

e.g.:

Sector 1 is dirty. Sector 2 is clean.
We reset the bitmap. All sectors are clean.
A write occurs to sector 2, marking it dirty.
We read sectors one and two.
Sector two is now erroneously marked dirty, when it isn't.

It's not a data integrity problem, but we're swapping one inefficiency 
for another.


Do you have a justification for why this is better than the lock?


I think without lock is always better. Lock-less operations are usually 
better, aren't they? So, I didn't think about such kind of efficiency, 
but only about that this lock is needless. If we don't swap these reset 
and read, lock is necessary of course.


Furthermore, If we have a lock, this write (to sector 2) will happen 
after unlock, and the sector will be dirty anyway. So, this efficiency 
is the same, but we can remove lock.

Re: [Qemu-devel] [PATCH 7/9] block-migration: remove not needed iothread lock

2015-01-16 Thread Vladimir Sementsov-Ogievskiy



Best regards,
Vladimir

On 09.01.2015 01:28, Paolo Bonzini wrote:


On 11/12/2014 15:17, Vladimir Sementsov-Ogievskiy wrote:

-qemu_mutex_lock_iothread();
+bdrv_reset_dirty_bitmap(bs, bmds-dirty_bitmap, cur_sector, nr_sectors);
+
  blk-aiocb = bdrv_aio_readv(bs, cur_sector, blk-qiov,
  nr_sectors, blk_mig_read_cb, blk);
  
-bdrv_reset_dirty_bitmap(bs, bmds-dirty_bitmap, cur_sector, nr_sectors);

-qemu_mutex_unlock_iothread();
-

bdrv_aio_readv is not thread safe, so it needs the iothread lock.

bdrv_reset_dirty_bitmap is also not thread safe, because
bmds-dirty_bitmap is accessed concurrently by bdrv_reset_dirty
(discard) and bdrv_set_dirty (write).  So it needs the iothread lock too.

Moving bdrv_reset_dirty_bitmap before the read is a good idea.

Paolo

Ok.

Re: [Qemu-devel] [PATCH 0/8] RCUification of the memory API, part 1

2015-01-16 Thread Christian Borntraeger

Am 13.01.2015 um 18:52 schrieb Paolo Bonzini:
 These are the minimal changes to adopt RCU and use it in memory_region_find
 (and hence in virtio-blk-dataplane).  Looks big, but two thirds of it
 is documentation and tests.
 
 Please review! :)

Really Nice. 
This gives me a boost from 1000MB/sec to 1400MB/sec on my fio test.
(4k block random reads on null block devices).
perf shows me that pthread_mutex_lock and kernel spin locking contention
(I have a 32cpu host) goes away.

Christian
 
 Paolo
 
 Jan Kiszka (1):
   memory: remove assertion on memory_region_destroy
 
 Paolo Bonzini (7):
   tls: require compiler support for __thread
   rcu: add rcu library
   rcu: add rcutorture
   rcu: allow nesting of rcu_read_lock/rcu_read_unlock
   rcu: add call_rcu
   memory: protect current_map by RCU
   memory: avoid ref/unref in memory_region_find
 
  configure |   9 +-
  docs/rcu.txt  | 388 +++
  exec.c|   2 +-
  hw/9pfs/virtio-9p-synth.c |   1 +
  include/exec/memory.h |   5 +
  include/qemu/atomic.h |  61 +++
  include/qemu/queue.h  |  13 ++
  include/qemu/rcu.h| 155 
  include/qemu/thread.h |   3 -
  include/qemu/tls.h|  52 --
  include/qom/cpu.h |   4 +-
  memory.c  |  60 +++
  tests/Makefile|  11 +-
  tests/rcutorture.c| 449 
 ++
  tests/test-tls.c  |  83 +
  util/Makefile.objs|   1 +
  util/rcu.c| 290 ++
  17 files changed, 1485 insertions(+), 102 deletions(-)
  create mode 100644 docs/rcu.txt
  create mode 100644 include/qemu/rcu.h
  delete mode 100644 include/qemu/tls.h
  create mode 100644 tests/rcutorture.c
  create mode 100644 tests/test-tls.c
  create mode 100644 util/rcu.c

Re: [Qemu-devel] [PATCH v2] qemu-iotests: Fix supported_oses check

2015-01-16 Thread Stefan Hajnoczi

On Fri, Jan 16, 2015 at 09:38:42AM +0800, Fam Zheng wrote:
 There is a bug in the recently added sys.platform test, and we no longer
 run python tests, because linux2 is the value to compare here. So do a
 prefix match. According to python doc [1], the way to use sys.platform
 is unless you want to test for a specific system version, it is
 therefore recommended to use the following idiom:
 
 if sys.platform.startswith('freebsd'):
 # FreeBSD-specific code here...
 elif sys.platform.startswith('linux'):
 # Linux-specific code here...
 
 [1]: https://docs.python.org/2.7/library/sys.html#sys.platform
 
 Signed-off-by: Fam Zheng f...@redhat.com
 
 ---
 v2: Don't use any().
 Explain why prefix match is fine.
 (Thanks, Stefan)
 ---
  tests/qemu-iotests/iotests.py | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan


pgpMWYxjO6qs6.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH v3 0/2] buildsys: Fix and enable module build

2015-01-16 Thread Stefan Hajnoczi

On Fri, Jan 16, 2015 at 09:54:05AM +0800, Fam Zheng wrote:
 v3: Fix commit message, error message and function name. (Stefan)
 
 Fam Zheng (2):
   configure: Default to enable module build
   .travis.yml: Add --disable-modules
 
  .travis.yml |  3 ++
  configure   | 95 
 ++---
  2 files changed, 69 insertions(+), 29 deletions(-)
 
 -- 
 1.9.3
 

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpdChbdwfbPl.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH] qtest: Fix deadloop by running main loop AIO context's timers

2015-01-16 Thread Stefan Hajnoczi

On Fri, Jan 16, 2015 at 04:01:17PM +0800, Fam Zheng wrote:
 qemu_clock_run_timers() only takes care of main_loop_tlg, we shouldn't
 forget aio timer list groups.
 
 Currently, the qemu_clock_deadline_ns_all (a few lines above) counts all
 the timergroups of this clock type, including aio tlg, but we don't fire
 them, so they are never cleared, which makes a dead loop.
 
 For example, this function hangs when trying to drive throttled block
 request queue with qtest clock_step.
 
 Signed-off-by: Fam Zheng f...@redhat.com
 ---
  cpus.c | 1 +
  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgp4EqiLM0cJe.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH] Add 'setup' phase to docs for query-migrate QMP command

2015-01-16 Thread Eric Blake

On 12/12/2014 07:14 AM, Alex Bligh wrote:
 The QMP command 'query-migrate' returns the state 'setup' during
 the setup phase. This patch documents it.
 
 Signed-off-by: Alex Bligh a...@alex.org.uk
 ---
  qmp-commands.hx | 11 ---
  1 file changed, 8 insertions(+), 3 deletions(-)
 

Reviewed-by: Eric Blake ebl...@redhat.com

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 1/2] tcg-aarch64: user doesn't need R/W access to exec

2015-01-16 Thread Peter Maydell

On 13 January 2015 at 15:48, Andrew Jones drjo...@redhat.com wrote:
 Table D4-32 shows that execute access from EL0 doesn't depend
 on AP[1].

This commit message is a bit sparse, which confused me
for a bit. It would be worth beefing it up a bit:

target-arm: 64-bit EL0 code can execute from unreadable pages

In AArch64 mode, a page can be executable even if it is not
readable (a difference from AArch32). Instead of bailing out
early if the page is not readable, just add 32 bit and
page not readable to the list of conditions that make a
page non-executable, and check whether the protections and
the access type are compatible once at the end of the function.

 Signed-off-by: Andrew Jones drjo...@redhat.com
 ---
  target-arm/helper.c | 27 ---
  1 file changed, 16 insertions(+), 11 deletions(-)

 diff --git a/target-arm/helper.c b/target-arm/helper.c
 index 3ef0f1f38eda5..7c30a2669a0f2 100644
 --- a/target-arm/helper.c
 +++ b/target-arm/helper.c
 @@ -4787,7 +4787,7 @@ static int get_phys_addr_lpae(CPUARMState *env, 
 target_ulong address,
  hwaddr descaddr, descmask;
  uint32_t tableattrs;
  target_ulong page_size;
 -uint32_t attrs;
 +uint32_t attrs, ap;
  int32_t granule_sz = 9;
  int32_t va_size = 32;
  int32_t tbi = 0;
 @@ -4952,14 +4952,20 @@ static int get_phys_addr_lpae(CPUARMState *env, 
 target_ulong address,
  /* Access flag */
  goto do_fault;
  }
 +
  fault_type = permission_fault;
 -if (is_user  !(attrs  (1  4))) {
 -/* Unprivileged access not enabled */
 -goto do_fault;
 +ap = extract32(attrs, 4, 2); /* AP[2:1] */
 +
 +*prot = 0;
 +if (!is_user || (ap  1)) {
 +*prot |= PAGE_READ;
 +*prot |= !(ap  2) ? PAGE_WRITE : 0;

Personally I would find
 if (!(ap  2)) {
 *prot |= PAGE_WRITE;
 }

clearer.

  }
 -*prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
 +
 +*prot |= PAGE_EXEC;
  if ((arm_feature(env, ARM_FEATURE_V8)  is_user  (attrs  (1  12))) 
 ||
  (!arm_feature(env, ARM_FEATURE_V8)  (attrs  (1  12))) ||
 +(!arm_el_is_aa64(env, 1)  is_user  !(ap  1)) ||
  (!is_user  (attrs  (1  11 {
  /* XN/UXN or PXN. Since we only implement EL0/EL1 we unconditionally
   * treat XN/UXN as UXN for v8.
 @@ -4969,12 +4975,11 @@ static int get_phys_addr_lpae(CPUARMState *env, 
 target_ulong address,
  }

There is a if access_type == 2 goto do_fault check just
above this hunk which you can delete, because we're now
doing that check in the code you add below.

  *prot = ~PAGE_EXEC;
  }
 -if (attrs  (1  5)) {
 -/* Write access forbidden */
 -if (access_type == 1) {
 -goto do_fault;
 -}
 -*prot = ~PAGE_WRITE;
 +
 +if ((*prot == 0)
 +|| (!(*prot  PAGE_WRITE)  access_type == 1)
 +|| (!(*prot  PAGE_EXEC)  access_type == 2)) {
 +goto do_fault;

Why isn't this just
if (!(*prot  (1  access_type))) {

? (Or at least, why doesn't it treat PAGE_READ the same way
as the other two bits?) As it is I think we'll treat a page
that is marked exec-not-readable as if it were readable.

thanks
-- PMM

Re: [Qemu-devel] [PATCH 2/4] console: add opengl rendering helper functions

2015-01-16 Thread Max Reitz


On 2015-01-16 at 04:21, Gerd Hoffmann wrote:

   Hi,


+bool console_gl_check_format(DisplayChangeListener *dcl,
+ pixman_format_code_t format)
+{
+switch (format) {
+case PIXMAN_x8r8g8b8:
+case PIXMAN_a8r8g8b8:
+case PIXMAN_r5g6b5:
+return true;
+default:
+return false;
+}
+}

What is this function supposed to be used for?

Inflight change, there is a patch series on the list adding a
check_format callback to DisplayChangeListenerOps, for format
negotiation.


OK.


+
+void surface_gl_create_texture(DisplaySurface *surface)
+{
+switch (surface-format) {
+case PIXMAN_x8r8g8b8:
+case PIXMAN_a8r8g8b8:
+surface-glformat = GL_BGRA;

Why does the format code seem to imply ARGB order but you're using the
exact opposite? I could imagine something like endianness being the
reason, but you're using RGB below where the format code says exactly that.

endianness indeed.  pixman formats are native endian, so this actually
is bgra ordering in memory.  While looking at it:  I guess GL_BGRA is
fixed byte ordering?


Yes, I think so.


So this probably needs fixing to work correctly on
bigendian machines ...


Probably, right.


Discarding the alpha channel is intentional, I suppose?

Yes.


+ surface_width(surface),
+ surface_height(surface),
+ 0, surface-glformat, surface-gltype,
+ surface_data(surface));

Is surface_stride(surface) specified to be surface_width(surface) *
bytes_per_pixel (I don't know pixman so I don't know)?

Usually this is the case, but there can be exceptions.  Hmm, can I
explicitly pass the stride?


The internet™ tells me about glPixelStorei(GL_UNPACK_ROW_LENGTH, 
$pixels_per_row) and glPixelStorei(GL_UNPACK_ALIGNMENT, 
$row_alignment_in_bytes) (which need to be called before using 
glTexImage2D() or glTexSubImage2D()). You may want to use it for 
surface_gl_update_texture(), too (if that isn't too much work).


Max

Re: [Qemu-devel] [PATCH v11 03/13] qmp: Add block-dirty-bitmap-add and block-dirty-bitmap-remove

2015-01-16 Thread Max Reitz


On 2015-01-12 at 11:30, John Snow wrote:

From: Fam Zheng f...@redhat.com

The new command pair is added to manage user created dirty bitmap. The
dirty bitmap's name is mandatory and must be unique for the same device,
but different devices can have bitmaps with the same names.

The granularity is an optional field. If it is not specified, we will
choose a default granularity based on the cluster size if available,
clamped to between 4K and 64K to mirror how the 'mirror' code was
already choosing granularity. If we do not have cluster size info
available, we choose 64K. This code has been factored out into a helper
shared with block/mirror.

This patch also introduces the 'block_dirty_bitmap_lookup' helper,
which takes a device name and a dirty bitmap name and validates the
lookup, returning NULL and setting errp if there is a problem with
either field. This helper will be re-used in future patches in this
series.

The types added to block-core.json will be re-used in future patches
in this series, see:
'qapi: Add transaction support to block-dirty-bitmap-{add, enable, disable}'

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   |  20 ++
  block/mirror.c|  10 +
  blockdev.c| 100 ++
  include/block/block.h |   1 +
  qapi/block-core.json  |  55 +++
  qmp-commands.hx   |  51 +
  6 files changed, 228 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index bfeae6b..3eb77ee 100644
--- a/block.c
+++ b/block.c
@@ -5417,6 +5417,26 @@ int bdrv_get_dirty(BlockDriverState *bs, BdrvDirtyBitmap 
*bitmap, int64_t sector
  }
  }
  
+/**

+ * Chooses a default granularity based on the existing cluster size,
+ * but clamped between [4K, 64K]. Defaults to 64K in the case that there
+ * is no cluster size information available.
+ */
+uint64_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs)
+{
+BlockDriverInfo bdi;
+uint64_t granularity;
+
+if (bdrv_get_info(bs, bdi) = 0  bdi.cluster_size != 0) {
+granularity = MAX(4096, bdi.cluster_size);
+granularity = MIN(65536, granularity);
+} else {
+granularity = 65536;
+}
+
+return granularity;
+}
+
  void bdrv_dirty_iter_init(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap, HBitmapIter *hbi)
  {
diff --git a/block/mirror.c b/block/mirror.c
index d819952..fc545f1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -667,15 +667,7 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
  MirrorBlockJob *s;
  
  if (granularity == 0) {

-/* Choose the default granularity based on the target file's cluster
- * size, clamped between 4k and 64k.  */
-BlockDriverInfo bdi;
-if (bdrv_get_info(target, bdi) = 0  bdi.cluster_size != 0) {
-granularity = MAX(4096, bdi.cluster_size);
-granularity = MIN(65536, granularity);
-} else {
-granularity = 65536;
-}
+granularity = bdrv_get_default_bitmap_granularity(target);
  }
  
  assert ((granularity  (granularity - 1)) == 0);

diff --git a/blockdev.c b/blockdev.c
index 5651a8e..95251c7 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1173,6 +1173,48 @@ out_aio_context:
  return NULL;
  }
  
+/**

+ * Return a dirty bitmap (if present), after validating
+ * the node reference and bitmap names. Returns NULL on error,
+ * including when the BDS and/or bitmap is not found.
+ */
+static BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char *node_ref,
+  const char *name,
+  BlockDriverState **pbs,
+  Error **errp)
+{
+BlockDriverState *bs;
+BdrvDirtyBitmap *bitmap;
+
+if (!node_ref) {
+error_setg(errp, Node reference cannot be NULL);
+return NULL;
+}
+if (!name) {
+error_setg(errp, Bitmap name cannot be NULL);
+return NULL;
+}
+
+bs = bdrv_lookup_bs(node_ref, node_ref, errp);
+if (!bs) {
+error_setg(errp, Node reference '%s' not found, node_ref);


No need to throw the (hopefully) perfectly fine Error code returned by 
bdrv_lookup_bs() away.



+return NULL;
+}
+
+/* If caller provided a BDS*, provide the result of that lookup, too. */
+if (pbs) {
+*pbs = bs;
+}
+
+bitmap = bdrv_find_dirty_bitmap(bs, name);
+if (!bitmap) {
+error_setg(errp, Dirty bitmap not found: %s, name);


I'd use Dirty bitmap '%s' not found, because foo: bar in an error 
message looks like foo because bar to me. But it's up to you, dirty 
bitmap names most likely don't look like error reasons so nobody should 
be able to confuse it.


So with the error_setg() in the fail path for bdrv_lookup_bs() removed:

[Qemu-devel] [PULL 09/16] block/dmg: fix sector data offset calculation

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

This patch addresses two issues:

 - The data fork offset was not taken into account, resulting in failure
   to read an InstallESD.dmg file (5164763151 bytes) which had a
   non-zero DataForkOffset field.
 - The offset of the previous block (partition) was unconditionally
   added to the current block because older files would start the input
   offset of a new block at zero. Newer files (including vlc-2.1.5.dmg,
   tuxpaint-0.9.15-macosx.dmg and OS X Yosemite [MAS].dmg) failed in
   reads because these files have chunk offsets, relative to the begin
   of a data fork.

Now the data offset of the mish is taken into account. While we could
check that the data_offset is within the data fork, let's not do that
here as it would only result in parse failures on invalid files (rather
than gracefully handling such bad files). dmg_read will error out if
the offset is incorrect.

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-9-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index 3f8f1cc..65fc7fe 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -186,7 +186,7 @@ static int64_t dmg_find_koly_offset(BlockDriverState 
*file_bs, Error **errp)
 typedef struct DmgHeaderState {
 /* used internally by dmg_read_mish_block to remember offsets of blocks
  * across calls */
-uint64_t last_in_offset;
+uint64_t data_fork_offset;
 uint64_t last_out_offset;
 /* exported for dmg_open */
 uint32_t max_compressed_size;
@@ -201,6 +201,8 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 size_t new_size;
 uint32_t chunk_count;
 int64_t offset = 0;
+uint64_t data_offset;
+uint64_t in_offset = ds-data_fork_offset;
 
 type = buff_read_uint32(buffer, offset);
 /* skip data that is not a valid MISH block (invalid magic or too small) */
@@ -209,8 +211,12 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 return 0;
 }
 
-offset += 4;
-offset += 200;
+/* location in data fork for (compressed) blob (in bytes) */
+data_offset = buff_read_uint64(buffer, offset + 0x18);
+in_offset += data_offset;
+
+/* move to begin of chunk entries */
+offset += 204;
 
 chunk_count = (count - 204) / 40;
 new_size = sizeof(uint64_t) * (s-n_chunks + chunk_count);
@@ -226,7 +232,6 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 if (s-types[i] != 0x8005  s-types[i] != 1 
 s-types[i] != 2) {
 if (s-types[i] == 0x  i  0) {
-ds-last_in_offset = s-offsets[i - 1] + s-lengths[i - 1];
 ds-last_out_offset = s-sectors[i - 1] +
   s-sectorcounts[i - 1];
 }
@@ -253,7 +258,7 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 }
 
 s-offsets[i] = buff_read_uint64(buffer, offset);
-s-offsets[i] += ds-last_in_offset;
+s-offsets[i] += in_offset;
 offset += 8;
 
 s-lengths[i] = buff_read_uint64(buffer, offset);
@@ -411,7 +416,7 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 s-n_chunks = 0;
 s-offsets = s-lengths = s-sectors = s-sectorcounts = NULL;
 /* used by dmg_read_mish_block to keep track of the current I/O position */
-ds.last_in_offset = 0;
+ds.data_fork_offset = 0;
 ds.last_out_offset = 0;
 ds.max_compressed_size = 1;
 ds.max_sectors_per_chunk = 1;
@@ -423,6 +428,15 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 goto fail;
 }
 
+/* offset of data fork (DataForkOffset) */
+ret = read_uint64(bs, offset + 0x18, ds.data_fork_offset);
+if (ret  0) {
+goto fail;
+} else if (ds.data_fork_offset  offset) {
+ret = -EINVAL;
+goto fail;
+}
+
 /* offset of resource fork (RsrcForkOffset) */
 ret = read_uint64(bs, offset + 0x28, rsrc_fork_offset);
 if (ret  0) {
-- 
2.1.0

Re: [Qemu-devel] [v3 4/5] Qemu-Xen-vTPM: Qemu vTPM xenstubdoms backen.

2015-01-16 Thread Xu, Quan



 -Original Message-
 From: Stefan Berger [mailto:stef...@linux.vnet.ibm.com]
 Sent: Thursday, January 15, 2015 11:49 PM
 To: Xu, Quan; qemu-devel@nongnu.org
 Cc: stefano.stabell...@eu.citrix.com; xen-de...@lists.xen.org
 Subject: Re: [Qemu-devel] [v3 4/5] Qemu-Xen-vTPM: Qemu vTPM xenstubdoms
 backen.
 
 On 12/30/2014 06:03 PM, Quan Xu wrote:
  This Patch provides the glue for the TPM_TIS(Qemu frontend) to Xen
  stubdom vTPM domain that provides the actual TPM functionality. It
  sends data and TPM commends with xen_vtpm_frontend. It is similar as
  another two vTPM backens:
 *vTPM passthrough backen Since QEMU 1.5.
 *vTPM libtpms-based backen.
 
  Some details:
  This part of the patch provides support for the spawning of a thread
  that will interact with stubdom vTPM domain by the xen_vtpm_frontend.
  It expects a signal from the frontend to wake and pick up the TPM
  command that is supposed to be processed and delivers the response
  packet using a callback function provided by the frontend.
 
  The backend connects itself to the frontend by filling out an
  interface structure with pointers to the function implementing support
  for various operations.
 
  (QEMU) vTPM XenStubdoms backen is initialized by Qemu command line
 options,
 -tpmdev xenstubdoms,id=xenvtpm0 -device
 tpm-tis,tpmdev=xenvtpm0
 
  --Changes in v3:
  -Call vtpm_send() and vtpm_recv() directly.
 
  Signed-off-by: Quan Xu quan...@intel.com
  ---
hw/tpm/Makefile.objs |   2 +-
hw/tpm/tpm_xenstubdoms.c | 245
 +++
2 files changed, 246 insertions(+), 1 deletion(-)
create mode 100644 hw/tpm/tpm_xenstubdoms.c
 
  diff --git a/hw/tpm/Makefile.objs b/hw/tpm/Makefile.objs index
  57919fa..190e776 100644
  --- a/hw/tpm/Makefile.objs
  +++ b/hw/tpm/Makefile.objs
  @@ -1,3 +1,3 @@
common-obj-$(CONFIG_TPM_TIS) += tpm_tis.o
common-obj-$(CONFIG_TPM_PASSTHROUGH) += tpm_passthrough.o
  -common-obj-$(CONFIG_TPM_XENSTUBDOMS) += xen_vtpm_frontend.o
  +common-obj-$(CONFIG_TPM_XENSTUBDOMS) += tpm_xenstubdoms.o
  +xen_vtpm_frontend.o
  diff --git a/hw/tpm/tpm_xenstubdoms.c b/hw/tpm/tpm_xenstubdoms.c new
  file mode 100644 index 000..98ea496
  --- /dev/null
  +++ b/hw/tpm/tpm_xenstubdoms.c
  @@ -0,0 +1,245 @@
  +/*
  + * Xen Stubdom vTPM driver
  + *
  + *  Copyright (c) 2014 Intel Corporation
  + *  Authors:
  + *Quan Xu quan...@intel.com
  + *
  + * This library is free software; you can redistribute it and/or
  + * modify it under the terms of the GNU Lesser General Public
  + * License as published by the Free Software Foundation; either
  + * version 2 of the License, or (at your option) any later version.
  + *
  + * This library is distributed in the hope that it will be useful,
  + * but WITHOUT ANY WARRANTY; without even the implied warranty of
  + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  + * Lesser General Public License for more details.
  + *
  + * You should have received a copy of the GNU Lesser General Public
  + * License along with this library; if not, see
  +http://www.gnu.org/licenses/  */
  +
  +#include dirent.h
  +#include qemu-common.h
  +#include qapi/error.h
  +#include qemu/sockets.h
  +#include qemu/log.h
  +#include sysemu/tpm_backend.h
  +#include tpm_int.h
  +#include hw/hw.h
  +#include hw/i386/pc.h
  +#include hw/xen/xen_backend.h
  +#include sysemu/tpm_backend_int.h
  +#include tpm_tis.h
  +
  +#ifdef DEBUG_TPM
  +#define DPRINTF(fmt, ...) \
  +do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0) #else
  +#define DPRINTF(fmt, ...) \
  +do { } while (0)
  +#endif
  +
  +#define TYPE_TPM_XENSTUBDOMS tpm-xenstubdoms
  +#define TPM_XENSTUBDOMS(obj) \
  +OBJECT_CHECK(TPMXenstubdomsState, (obj),
 TYPE_TPM_XENSTUBDOMS)
  +
  +static const TPMDriverOps tpm_xenstubdoms_driver;
  +
  +/* data structures */
  +typedef struct TPMXenstubdomsThreadParams {
  +TPMState *tpm_state;
  +TPMRecvDataCB *recv_data_callback;
  +TPMBackend *tb;
  +} TPMXenstubdomsThreadParams;
  +
  +struct TPMXenstubdomsState {
  +TPMBackend parent;
  +TPMBackendThread tbt;
  +TPMXenstubdomsThreadParams tpm_thread_params;
  +bool had_startup_error;
  +};
  +
  +typedef struct TPMXenstubdomsState TPMXenstubdomsState;
  +
  +/* functions */
  +
  +static void tpm_xenstubdoms_cancel_cmd(TPMBackend *tb);
  +
  +static int tpm_xenstubdoms_unix_transfer(const TPMLocality
  +*locty_data) {
  +size_t rlen;
  +struct XenDevice *xendev;
  +
  +xendev = xen_be_find_xendev(vtpm, xen_domid, 0);
  +if (xendev == NULL) {
  +xen_be_printf(xendev, 0, Con not find vtpm device\n);
  +return -1;
  +}
  +vtpm_send(xendev, locty_data-w_buffer.buffer,
 locty_data-w_offset);
  +vtpm_recv(xendev, locty_data-r_buffer.buffer, rlen);
  +return 0;
  +}
  +
  +static void tpm_xenstubdoms_worker_thread(gpointer data,
  +  gpointer

[Qemu-devel] [PULL 00/16] Block patches

2015-01-16 Thread Stefan Hajnoczi

The following changes since commit df58887b20fab8fe8a6dcca4db30cd4e4077d53a:

  Merge remote-tracking branch 
'remotes/mjt/tags/pull-trivial-patches-2015-01-15' into staging (2015-01-15 
10:08:46 +)

are available in the git repository at:

  git://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to d6e2630f17fa40342af80d20c04f279fa23ea189:

  qemu-iotests: Fix supported_oses check (2015-01-16 15:36:34 +)





Fam Zheng (1):
  qemu-iotests: Fix supported_oses check

Francesco Romani (1):
  block: add event when disk usage exceeds threshold

Peter Wu (12):
  block/dmg: properly detect the UDIF trailer
  block/dmg: extract mish block decoding functionality
  block/dmg: extract processing of resource forks
  block/dmg: process a buffer instead of reading ints
  block/dmg: validate chunk size to avoid overflow
  block/dmg: process XML plists
  block/dmg: set virtual size to a non-zero value
  block/dmg: fix sector data offset calculation
  block/dmg: use SectorNumber from BLKX header
  block/dmg: factor out block type check
  block/dmg: support bzip2 block entry types
  block/dmg: improve zeroes handling

Stefan Hajnoczi (2):
  qed: check for header size overflow
  qemu-iotests: add 116 invalid QED input file tests

 block/Makefile.objs |   2 +
 block/dmg.c | 501 ++--
 block/qapi.c|   3 +
 block/qed.c |   5 +
 block/write-threshold.c | 125 ++
 configure   |  31 +++
 include/block/block_int.h   |   4 +
 include/block/write-threshold.h |  64 +
 qapi/block-core.json|  51 +++-
 qmp-commands.hx |  32 +++
 tests/Makefile  |   3 +
 tests/qemu-iotests/067.out  |   5 +
 tests/qemu-iotests/116  |  96 
 tests/qemu-iotests/116.out  |  37 +++
 tests/qemu-iotests/group|   1 +
 tests/qemu-iotests/iotests.py   |   2 +-
 tests/test-write-threshold.c| 119 ++
 17 files changed, 962 insertions(+), 119 deletions(-)
 create mode 100644 block/write-threshold.c
 create mode 100644 include/block/write-threshold.h
 create mode 100755 tests/qemu-iotests/116
 create mode 100644 tests/qemu-iotests/116.out
 create mode 100644 tests/test-write-threshold.c

-- 
2.1.0

[Qemu-devel] [PULL 02/16] block/dmg: properly detect the UDIF trailer

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

DMG files have a variable length with a UDIF trailer at the end of a
file. This UDIF trailer is essential as it describes the contents of
the image. At the moment however, the start of this trailer is almost
always incorrect as bdrv_getlength() returns a multiple of the block
size (rounded up). This results in a failure to recognize DMG files,
resulting in Invalid argument (EINVAL) errors.

As there is no API to retrieve the real file size, look for the magic
header in the last two sectors to find the start of this 512-byte UDIF
trailer (the koly block).

The resource fork offset (info_begin) has its offset adjusted as the
initial value of offset does not mean end of file anymore, but begin
of UDIF trailer.

[Replaced error_set(errp, ERROR_CLASS_GENERIC_ERROR, ...) with
error_setg(errp, ...) as discussed with Peter.
--Stefan]

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
Message-id: 1420566495-13284-2-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 47 +++
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index e455886..cdad28f 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -131,6 +131,46 @@ static void update_max_chunk_size(BDRVDMGState *s, 
uint32_t chunk,
 }
 }
 
+static int64_t dmg_find_koly_offset(BlockDriverState *file_bs, Error **errp)
+{
+int64_t length;
+int64_t offset = 0;
+uint8_t buffer[515];
+int i, ret;
+
+/* bdrv_getlength returns a multiple of block size (512), rounded up. Since
+ * dmg images can have odd sizes, try to look for the koly magic which
+ * marks the begin of the UDIF trailer (512 bytes). This magic can be found
+ * in the last 511 bytes of the second-last sector or the first 4 bytes of
+ * the last sector (search space: 515 bytes) */
+length = bdrv_getlength(file_bs);
+if (length  0) {
+error_setg_errno(errp, -length,
+Failed to get file size while reading UDIF trailer);
+return length;
+} else if (length  512) {
+error_setg(errp, dmg file must be at least 512 bytes long);
+return -EINVAL;
+}
+if (length  511 + 512) {
+offset = length - 511 - 512;
+}
+length = length  515 ? length : 515;
+ret = bdrv_pread(file_bs, offset, buffer, length);
+if (ret  0) {
+error_setg_errno(errp, -ret, Failed while reading UDIF trailer);
+return ret;
+}
+for (i = 0; i  length - 3; i++) {
+if (buffer[i] == 'k'  buffer[i+1] == 'o' 
+buffer[i+2] == 'l'  buffer[i+3] == 'y') {
+return offset + i;
+}
+}
+error_setg(errp, Could not locate UDIF trailer in dmg file);
+return -EINVAL;
+}
+
 static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
@@ -145,15 +185,14 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 s-n_chunks = 0;
 s-offsets = s-lengths = s-sectors = s-sectorcounts = NULL;
 
-/* read offset of info blocks */
-offset = bdrv_getlength(bs-file);
+/* locate the UDIF trailer */
+offset = dmg_find_koly_offset(bs-file, errp);
 if (offset  0) {
 ret = offset;
 goto fail;
 }
-offset -= 0x1d8;
 
-ret = read_uint64(bs, offset, info_begin);
+ret = read_uint64(bs, offset + 0x28, info_begin);
 if (ret  0) {
 goto fail;
 } else if (info_begin == 0) {
-- 
2.1.0

[Qemu-devel] [PULL 01/16] block: add event when disk usage exceeds threshold

2015-01-16 Thread Stefan Hajnoczi

From: Francesco Romani from...@redhat.com

Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command `block-set-write-threshold', to set a mark for
  a given block device.
* A new event `BLOCK_WRITE_THRESHOLD', to report if a block device
  usage exceeds the threshold.
* A new `write_threshold' field into the `BlockDeviceInfo' structure,
  to report the configured threshold.

This will allow the managing application to use smarter and more
efficient monitoring, greatly reducing the need of polling.

[Updated qemu-iotests 067 output to add the new 'write_threshold'
property.
--Stefan]

Signed-off-by: Francesco Romani from...@redhat.com
Reviewed-by: Eric Blake ebl...@redhat.com
Message-id: 1421068273-692-1-git-send-email-from...@redhat.com
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/write-threshold.c | 125 
 include/block/block_int.h   |   4 ++
 include/block/write-threshold.h |  64 
 qapi/block-core.json|  51 +++-
 qmp-commands.hx |  32 ++
 tests/Makefile  |   3 +
 tests/qemu-iotests/067.out  |   5 ++
 tests/test-write-threshold.c| 119 ++
 10 files changed, 406 insertions(+), 1 deletion(-)
 create mode 100644 block/write-threshold.c
 create mode 100644 include/block/write-threshold.h
 create mode 100644 tests/test-write-threshold.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..010afad 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += write-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index a6fd6f7..03abaab 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include block/qapi.h
 #include block/block_int.h
+#include block/write-threshold.h
 #include qmp-commands.h
 #include qapi-visit.h
 #include qapi/qmp-output-visitor.h
@@ -89,6 +90,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
 info-iops_size = cfg.op_size;
 }
 
+info-write_threshold = bdrv_write_threshold_get(bs);
+
 return info;
 }
 
diff --git a/block/write-threshold.c b/block/write-threshold.c
new file mode 100644
index 000..c2cd517
--- /dev/null
+++ b/block/write-threshold.c
@@ -0,0 +1,125 @@
+/*
+ * QEMU System Emulator block write threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani from...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include block/block_int.h
+#include block/coroutine.h
+#include block/write-threshold.h
+#include qemu/notify.h
+#include qapi-event.h
+#include qmp-commands.h
+
+
+uint64_t bdrv_write_threshold_get(const BlockDriverState *bs)
+{
+return bs-write_threshold_offset;
+}
+
+bool bdrv_write_threshold_is_set(const BlockDriverState *bs)
+{
+return bs-write_threshold_offset  0;
+}
+
+static void write_threshold_disable(BlockDriverState *bs)
+{
+if (bdrv_write_threshold_is_set(bs)) {
+notifier_with_return_remove(bs-write_threshold_notifier);
+bs-write_threshold_offset = 0;
+}
+}
+
+uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
+   const BdrvTrackedRequest *req)
+{
+if (bdrv_write_threshold_is_set(bs)) {
+if (req-offset  bs-write_threshold_offset) {
+return (req-offset - bs-write_threshold_offset) + req-bytes;
+}
+if ((req-offset + req-bytes)  bs-write_threshold_offset) {
+return (req-offset + req-bytes) - bs-write_threshold_offset;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+uint64_t amount = 0;
+
+amount = bdrv_write_threshold_exceeded(bs, req);
+if (amount  0) {
+

[Qemu-devel] [PULL 03/16] block/dmg: extract mish block decoding functionality

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

Extract the mish block decoder such that this can be used for other
formats in the future. A new DmgHeaderState struct is introduced to
share state while decoding.

The code is kept unchanged as much as possible, a fail label is added
for example where a simple return would probably do. In dmg_open, the
variable tmp is renamed to rsrc_data_offset for clarity and comments
have been added explaining various data.

Note that this patch has one subtle difference with the previous
version which should not affect functionality. In the previous code,
the end of a resource was inferred from the mish block (the offsets
would be increased by the fields). In this patch, the resource length
is used instead to avoid the need to rely on the previous offsets.

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
Message-id: 1420566495-13284-3-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 228 +++-
 1 file changed, 133 insertions(+), 95 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index cdad28f..c571ac9 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -171,19 +171,138 @@ static int64_t dmg_find_koly_offset(BlockDriverState 
*file_bs, Error **errp)
 return -EINVAL;
 }
 
+/* used when building the sector table */
+typedef struct DmgHeaderState {
+/* used internally by dmg_read_mish_block to remember offsets of blocks
+ * across calls */
+uint64_t last_in_offset;
+uint64_t last_out_offset;
+/* exported for dmg_open */
+uint32_t max_compressed_size;
+uint32_t max_sectors_per_chunk;
+} DmgHeaderState;
+
+static int dmg_read_mish_block(BlockDriverState *bs, DmgHeaderState *ds,
+   int64_t offset, uint32_t count)
+{
+BDRVDMGState *s = bs-opaque;
+uint32_t type, i;
+int ret;
+size_t new_size;
+uint32_t chunk_count;
+
+ret = read_uint32(bs, offset, type);
+if (ret  0) {
+goto fail;
+}
+
+/* skip data that is not a valid MISH block (invalid magic or too small) */
+if (type != 0x6d697368 || count  244) {
+/* assume success for now */
+return 0;
+}
+
+offset += 4;
+offset += 200;
+
+chunk_count = (count - 204) / 40;
+new_size = sizeof(uint64_t) * (s-n_chunks + chunk_count);
+s-types = g_realloc(s-types, new_size / 2);
+s-offsets = g_realloc(s-offsets, new_size);
+s-lengths = g_realloc(s-lengths, new_size);
+s-sectors = g_realloc(s-sectors, new_size);
+s-sectorcounts = g_realloc(s-sectorcounts, new_size);
+
+for (i = s-n_chunks; i  s-n_chunks + chunk_count; i++) {
+ret = read_uint32(bs, offset, s-types[i]);
+if (ret  0) {
+goto fail;
+}
+offset += 4;
+if (s-types[i] != 0x8005  s-types[i] != 1 
+s-types[i] != 2) {
+if (s-types[i] == 0x  i  0) {
+ds-last_in_offset = s-offsets[i - 1] + s-lengths[i - 1];
+ds-last_out_offset = s-sectors[i - 1] +
+  s-sectorcounts[i - 1];
+}
+chunk_count--;
+i--;
+offset += 36;
+continue;
+}
+offset += 4;
+
+ret = read_uint64(bs, offset, s-sectors[i]);
+if (ret  0) {
+goto fail;
+}
+s-sectors[i] += ds-last_out_offset;
+offset += 8;
+
+ret = read_uint64(bs, offset, s-sectorcounts[i]);
+if (ret  0) {
+goto fail;
+}
+offset += 8;
+
+if (s-sectorcounts[i]  DMG_SECTORCOUNTS_MAX) {
+error_report(sector count % PRIu64  for chunk % PRIu32
+  is larger than max (%u),
+ s-sectorcounts[i], i, DMG_SECTORCOUNTS_MAX);
+ret = -EINVAL;
+goto fail;
+}
+
+ret = read_uint64(bs, offset, s-offsets[i]);
+if (ret  0) {
+goto fail;
+}
+s-offsets[i] += ds-last_in_offset;
+offset += 8;
+
+ret = read_uint64(bs, offset, s-lengths[i]);
+if (ret  0) {
+goto fail;
+}
+offset += 8;
+
+if (s-lengths[i]  DMG_LENGTHS_MAX) {
+error_report(length % PRIu64  for chunk % PRIu32
+  is larger than max (%u),
+ s-lengths[i], i, DMG_LENGTHS_MAX);
+ret = -EINVAL;
+goto fail;
+}
+
+update_max_chunk_size(s, i, ds-max_compressed_size,
+  ds-max_sectors_per_chunk);
+}
+s-n_chunks += chunk_count;
+return 0;
+
+fail:
+return ret;
+}
+
 static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
 BDRVDMGState *s = bs-opaque;
-uint64_t info_begin,

[Qemu-devel] [PULL 14/16] qed: check for header size overflow

2015-01-16 Thread Stefan Hajnoczi

Header size is denoted in clusters.  The maximum cluster size is 64 MB
but there is no limit on header size.  Check for uint32_t overflow in
case the header size field has a whacky value.

Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
Message-id: 1421065893-18875-2-git-send-email-stefa...@redhat.com
Reviewed-by: Kevin Wolf kw...@redhat.com
---
 block/qed.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/qed.c b/block/qed.c
index 80f18d8..892b13c 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -440,6 +440,11 @@ static int bdrv_qed_open(BlockDriverState *bs, QDict 
*options, int flags,
 s-l2_mask = s-table_nelems - 1;
 s-l1_shift = s-l2_shift + ffs(s-table_nelems) - 1;
 
+/* Header size calculation must not overflow uint32_t */
+if (s-header.header_size  UINT32_MAX / s-header.cluster_size) {
+return -EINVAL;
+}
+
 if ((s-header.features  QED_F_BACKING_FILE)) {
 if ((uint64_t)s-header.backing_filename_offset +
 s-header.backing_filename_size 
-- 
2.1.0

[Qemu-devel] [PULL 06/16] block/dmg: validate chunk size to avoid overflow

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

Previously the chunk size was not checked, allowing for a large memory
allocation. This patch checks whether the chunks size is within the
resource fork length, and whether the resource fork is below the
trailer of the dmg file.

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-6-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/dmg.c b/block/dmg.c
index 4f56227..5c2c2c2 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -317,7 +317,7 @@ static int dmg_read_resource_fork(BlockDriverState *bs, 
DmgHeaderState *ds,
 ret = read_uint32(bs, offset, count);
 if (ret  0) {
 goto fail;
-} else if (count == 0) {
+} else if (count == 0 || count  info_end - offset) {
 ret = -EINVAL;
 goto fail;
 }
@@ -377,6 +377,11 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 if (ret  0) {
 goto fail;
 }
+if (rsrc_fork_offset = offset ||
+rsrc_fork_length  offset - rsrc_fork_offset) {
+ret = -EINVAL;
+goto fail;
+}
 if (rsrc_fork_length != 0) {
 ret = dmg_read_resource_fork(bs, ds,
  rsrc_fork_offset, rsrc_fork_length);
-- 
2.1.0

[Qemu-devel] [PULL 04/16] block/dmg: extract processing of resource forks

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

Besides the offset, also read the resource length. This length is now
used in the extracted function to verify the end of the resource fork
against count from the resource fork.

Instead of relying on the value of offset to conclude whether the
resource fork is available or not (info_begin==0), check the
rsrc_fork_length instead. This would allow a dmg file to begin with a
resource fork. This seemingly unnecessary restriction was found while
trying to craft a DMG file by hand.

Other changes:

 - Do not require resource data offset to be 0x100 (but check that it
   is within bounds though).
 - Further improve boundary checking (resource data must be within
   the resource fork).
 - Use correct value for resource data length (spotted by John Snow)
 - Consider the resource data offset when determining info_end.
   This fixes an EINVAL on the tuxpaint dmg example.

The resource fork format is documented at
https://developer.apple.com/legacy/library/documentation/mac/pdf/MoreMacintoshToolbox.pdf#page=151

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-4-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 104 ++--
 1 file changed, 66 insertions(+), 38 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index c571ac9..04bae72 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -285,60 +285,38 @@ fail:
 return ret;
 }
 
-static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
-Error **errp)
+static int dmg_read_resource_fork(BlockDriverState *bs, DmgHeaderState *ds,
+  uint64_t info_begin, uint64_t info_length)
 {
-BDRVDMGState *s = bs-opaque;
-DmgHeaderState ds;
-uint64_t info_begin, info_end;
-uint32_t count, rsrc_data_offset;
-int64_t offset;
 int ret;
+uint32_t count, rsrc_data_offset;
+uint64_t info_end;
+uint64_t offset;
 
-bs-read_only = 1;
-s-n_chunks = 0;
-s-offsets = s-lengths = s-sectors = s-sectorcounts = NULL;
-/* used by dmg_read_mish_block to keep track of the current I/O position */
-ds.last_in_offset = 0;
-ds.last_out_offset = 0;
-ds.max_compressed_size = 1;
-ds.max_sectors_per_chunk = 1;
-
-/* locate the UDIF trailer */
-offset = dmg_find_koly_offset(bs-file, errp);
-if (offset  0) {
-ret = offset;
-goto fail;
-}
-
-ret = read_uint64(bs, offset + 0x28, info_begin);
-if (ret  0) {
-goto fail;
-} else if (info_begin == 0) {
-ret = -EINVAL;
-goto fail;
-}
-
+/* read offset from begin of resource fork (info_begin) to resource data */
 ret = read_uint32(bs, info_begin, rsrc_data_offset);
 if (ret  0) {
 goto fail;
-} else if (rsrc_data_offset != 0x100) {
+} else if (rsrc_data_offset  info_length) {
 ret = -EINVAL;
 goto fail;
 }
 
-ret = read_uint32(bs, info_begin + 4, count);
+/* read length of resource data */
+ret = read_uint32(bs, info_begin + 8, count);
 if (ret  0) {
 goto fail;
-} else if (count == 0) {
+} else if (count == 0 || rsrc_data_offset + count  info_length) {
 ret = -EINVAL;
 goto fail;
 }
-/* end of resource data, ignoring the following resource map */
-info_end = info_begin + count;
 
 /* begin of resource data (consisting of one or more resources) */
-offset = info_begin + 0x100;
+offset = info_begin + rsrc_data_offset;
+
+/* end of resource data (there is possibly a following resource map
+ * which will be ignored). */
+info_end = offset + count;
 
 /* read offsets (mish blocks) from one or more resources in resource data 
*/
 while (offset  info_end) {
@@ -352,13 +330,63 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 }
 offset += 4;
 
-ret = dmg_read_mish_block(bs, ds, offset, count);
+ret = dmg_read_mish_block(bs, ds, offset, count);
 if (ret  0) {
 goto fail;
 }
 /* advance offset by size of resource */
 offset += count;
 }
+return 0;
+
+fail:
+return ret;
+}
+
+static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
+Error **errp)
+{
+BDRVDMGState *s = bs-opaque;
+DmgHeaderState ds;
+uint64_t rsrc_fork_offset, rsrc_fork_length;
+int64_t offset;
+int ret;
+
+bs-read_only = 1;
+s-n_chunks = 0;
+s-offsets = s-lengths = s-sectors = s-sectorcounts = NULL;
+/* used by dmg_read_mish_block to keep track of the current I/O position */
+ds.last_in_offset = 0;
+ds.last_out_offset = 0;
+ds.max_compressed_size = 1;
+ds.max_sectors_per_chunk = 1;
+
+/* locate the UDIF trailer */
+offset = dmg_find_koly_offset(bs-file, errp);
+if (offset

[Qemu-devel] [PULL 12/16] block/dmg: support bzip2 block entry types

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

This patch adds support for bzip2-compressed block entries as introduced
with OS X 10.4 (source: https://en.wikipedia.org/wiki/Apple_Disk_Image).

It was tested against a 5.2G OS X Yosemite installation image which
stores the BLXX block in the XML property list (instead of resource
forks) and has over 5k chunks.

New configure entries are added (--enable-bzip2 / --disable-bzip2) to
control inclusion of bzip2 functionality (which requires linking against
libbz2). The help message suggests that this option is needed for DMG
files, but the tests are generic enough that other parts of QEMU can use
bzip2 if needed.

The identifiers are based on http://newosxbook.com/DMG.html.

The decompression routines are based on the zlib case, but as there is
no way to reset the decompression state (unlike zlib), memory is
allocated and deallocated for every decompression. This should not be
problematic as the decompression takes most of the time and as blocks
are typically about/over 1 MiB in size, only one allocation is done
every 2000 sectors.

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Acked-by: Paolo Bonzini pbonz...@redhat.com
Message-id: 1420566495-13284-12-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/Makefile.objs |  1 +
 block/dmg.c | 43 ++-
 configure   | 31 +++
 3 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 010afad..db2933e 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -37,5 +37,6 @@ gluster.o-libs := $(GLUSTERFS_LIBS)
 ssh.o-cflags   := $(LIBSSH2_CFLAGS)
 ssh.o-libs := $(LIBSSH2_LIBS)
 archipelago.o-libs := $(ARCHIPELAGO_LIBS)
+dmg.o-libs := $(BZIP2_LIBS)
 qcow.o-libs:= -lz
 linux-aio.o-libs   := -laio
diff --git a/block/dmg.c b/block/dmg.c
index 2bb7078..196c746 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -26,6 +26,9 @@
 #include qemu/bswap.h
 #include qemu/module.h
 #include zlib.h
+#ifdef CONFIG_BZIP2
+#include bzlib.h
+#endif
 #include glib.h
 
 enum {
@@ -56,6 +59,9 @@ typedef struct BDRVDMGState {
 uint8_t *compressed_chunk;
 uint8_t *uncompressed_chunk;
 z_stream zstream;
+#ifdef CONFIG_BZIP2
+bz_stream bzstream;
+#endif
 } BDRVDMGState;
 
 static int dmg_probe(const uint8_t *buf, int buf_size, const char *filename)
@@ -123,6 +129,7 @@ static void update_max_chunk_size(BDRVDMGState *s, uint32_t 
chunk,
 
 switch (s-types[chunk]) {
 case 0x8005: /* zlib compressed */
+case 0x8006: /* bzip2 compressed */
 compressed_size = s-lengths[chunk];
 uncompressed_sectors = s-sectorcounts[chunk];
 break;
@@ -198,6 +205,9 @@ static bool dmg_is_known_block_type(uint32_t entry_type)
 case 0x0001:/* uncompressed */
 case 0x0002:/* zeroes */
 case 0x8005:/* zlib */
+#ifdef CONFIG_BZIP2
+case 0x8006:/* bzip2 */
+#endif
 return true;
 default:
 return false;
@@ -563,13 +573,16 @@ static inline int dmg_read_chunk(BlockDriverState *bs, 
uint64_t sector_num)
 if (!is_sector_in_chunk(s, s-current_chunk, sector_num)) {
 int ret;
 uint32_t chunk = search_chunk(s, sector_num);
+#ifdef CONFIG_BZIP2
+uint64_t total_out;
+#endif
 
 if (chunk = s-n_chunks) {
 return -1;
 }
 
 s-current_chunk = s-n_chunks;
-switch (s-types[chunk]) {
+switch (s-types[chunk]) { /* block entry type */
 case 0x8005: { /* zlib compressed */
 /* we need to buffer, because only the chunk as whole can be
  * inflated. */
@@ -593,6 +606,34 @@ static inline int dmg_read_chunk(BlockDriverState *bs, 
uint64_t sector_num)
 return -1;
 }
 break; }
+#ifdef CONFIG_BZIP2
+case 0x8006: /* bzip2 compressed */
+/* we need to buffer, because only the chunk as whole can be
+ * inflated. */
+ret = bdrv_pread(bs-file, s-offsets[chunk],
+ s-compressed_chunk, s-lengths[chunk]);
+if (ret != s-lengths[chunk]) {
+return -1;
+}
+
+ret = BZ2_bzDecompressInit(s-bzstream, 0, 0);
+if (ret != BZ_OK) {
+return -1;
+}
+s-bzstream.next_in = (char *)s-compressed_chunk;
+s-bzstream.avail_in = (unsigned int) s-lengths[chunk];
+s-bzstream.next_out = (char *)s-uncompressed_chunk;
+s-bzstream.avail_out = (unsigned int) 512 * 
s-sectorcounts[chunk];
+ret = BZ2_bzDecompress(s-bzstream);
+total_out = ((uint64_t)s-bzstream.total_out_hi32  32) +
+s-bzstream.total_out_lo32;
+BZ2_bzDecompressEnd(s-bzstream);
+if (ret

[Qemu-devel] [PULL 07/16] block/dmg: process XML plists

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

The format is simple enough to avoid using a full-blown XML parser. It
assumes that all BLKX items begin with the mish magic word, therefore
it is not a problem if other values get matched which are not a BLKX
block.

The offsets are based on the description at
http://newosxbook.com/DMG.html

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-7-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 74 +
 1 file changed, 74 insertions(+)

diff --git a/block/dmg.c b/block/dmg.c
index 5c2c2c2..d4a7520 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -26,6 +26,7 @@
 #include qemu/bswap.h
 #include qemu/module.h
 #include zlib.h
+#include glib.h
 
 enum {
 /* Limit chunk sizes to prevent unreasonable amounts of memory being used
@@ -343,12 +344,66 @@ fail:
 return ret;
 }
 
+static int dmg_read_plist_xml(BlockDriverState *bs, DmgHeaderState *ds,
+  uint64_t info_begin, uint64_t info_length)
+{
+BDRVDMGState *s = bs-opaque;
+int ret;
+uint8_t *buffer = NULL;
+char *data_begin, *data_end;
+
+/* Have at least some length to avoid NULL for g_malloc. Attempt to set a
+ * safe upper cap on the data length. A test sample had a XML length of
+ * about 1 MiB. */
+if (info_length == 0 || info_length  16 * 1024 * 1024) {
+ret = -EINVAL;
+goto fail;
+}
+
+buffer = g_malloc(info_length + 1);
+buffer[info_length] = '\0';
+ret = bdrv_pread(bs-file, info_begin, buffer, info_length);
+if (ret != info_length) {
+ret = -EINVAL;
+goto fail;
+}
+
+/* look for data.../data. The data is 284 (0x11c) bytes after base64
+ * decode. The actual data element has 431 (0x1af) bytes which includes 
tabs
+ * and line feeds. */
+data_end = (char *)buffer;
+while ((data_begin = strstr(data_end, data)) != NULL) {
+gsize out_len = 0;
+
+data_begin += 6;
+data_end = strstr(data_begin, /data);
+/* malformed XML? */
+if (data_end == NULL) {
+ret = -EINVAL;
+goto fail;
+}
+*data_end++ = '\0';
+g_base64_decode_inplace(data_begin, out_len);
+ret = dmg_read_mish_block(s, ds, (uint8_t *)data_begin,
+  (uint32_t)out_len);
+if (ret  0) {
+goto fail;
+}
+}
+ret = 0;
+
+fail:
+g_free(buffer);
+return ret;
+}
+
 static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
 BDRVDMGState *s = bs-opaque;
 DmgHeaderState ds;
 uint64_t rsrc_fork_offset, rsrc_fork_length;
+uint64_t plist_xml_offset, plist_xml_length;
 int64_t offset;
 int ret;
 
@@ -382,12 +437,31 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 ret = -EINVAL;
 goto fail;
 }
+/* offset of property list (XMLOffset) */
+ret = read_uint64(bs, offset + 0xd8, plist_xml_offset);
+if (ret  0) {
+goto fail;
+}
+ret = read_uint64(bs, offset + 0xe0, plist_xml_length);
+if (ret  0) {
+goto fail;
+}
+if (plist_xml_offset = offset ||
+plist_xml_length  offset - plist_xml_offset) {
+ret = -EINVAL;
+goto fail;
+}
 if (rsrc_fork_length != 0) {
 ret = dmg_read_resource_fork(bs, ds,
  rsrc_fork_offset, rsrc_fork_length);
 if (ret  0) {
 goto fail;
 }
+} else if (plist_xml_length != 0) {
+ret = dmg_read_plist_xml(bs, ds, plist_xml_offset, plist_xml_length);
+if (ret  0) {
+goto fail;
+}
 } else {
 ret = -EINVAL;
 goto fail;
-- 
2.1.0

[Qemu-devel] [PULL 08/16] block/dmg: set virtual size to a non-zero value

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

Right now the virtual size is always reported as zero which makes it
impossible to convert between formats.

After this patch, the number of sectors will be read from the trailer
(koly block).

To verify the behavior, the output of `dmg2img foo.dmg foo.img` was
compared against `qemu-img convert -f dmg -O raw foo.dmg foo.raw`. The
tests showed that the file contents are exactly the same, except that
QEMU creates a slightly larger file (it matches the total sectors
count).

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-8-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/block/dmg.c b/block/dmg.c
index d4a7520..3f8f1cc 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -451,6 +451,14 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 ret = -EINVAL;
 goto fail;
 }
+ret = read_uint64(bs, offset + 0x1ec, (uint64_t *)bs-total_sectors);
+if (ret  0) {
+goto fail;
+}
+if (bs-total_sectors  0) {
+ret = -EINVAL;
+goto fail;
+}
 if (rsrc_fork_length != 0) {
 ret = dmg_read_resource_fork(bs, ds,
  rsrc_fork_offset, rsrc_fork_length);
-- 
2.1.0

[Qemu-devel] [PULL 11/16] block/dmg: factor out block type check

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

In preparation for adding bzip2 support, split the type check into a
separate function. Make all offsets relative to the begin of a chunk
such that it is easier to recognize the position without having to
add up all offsets. Some comments are added to describe the fields.

There is no functional change.

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-11-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 36 +++-
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index 0f278c8..2bb7078 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -192,6 +192,18 @@ typedef struct DmgHeaderState {
 uint32_t max_sectors_per_chunk;
 } DmgHeaderState;
 
+static bool dmg_is_known_block_type(uint32_t entry_type)
+{
+switch (entry_type) {
+case 0x0001:/* uncompressed */
+case 0x0002:/* zeroes */
+case 0x8005:/* zlib */
+return true;
+default:
+return false;
+}
+}
+
 static int dmg_read_mish_block(BDRVDMGState *s, DmgHeaderState *ds,
uint8_t *buffer, uint32_t count)
 {
@@ -231,22 +243,19 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 
 for (i = s-n_chunks; i  s-n_chunks + chunk_count; i++) {
 s-types[i] = buff_read_uint32(buffer, offset);
-offset += 4;
-if (s-types[i] != 0x8005  s-types[i] != 1 
-s-types[i] != 2) {
+if (!dmg_is_known_block_type(s-types[i])) {
 chunk_count--;
 i--;
-offset += 36;
+offset += 40;
 continue;
 }
-offset += 4;
 
-s-sectors[i] = buff_read_uint64(buffer, offset);
+/* sector number */
+s-sectors[i] = buff_read_uint64(buffer, offset + 8);
 s-sectors[i] += out_offset;
-offset += 8;
 
-s-sectorcounts[i] = buff_read_uint64(buffer, offset);
-offset += 8;
+/* sector count */
+s-sectorcounts[i] = buff_read_uint64(buffer, offset + 0x10);
 
 if (s-sectorcounts[i]  DMG_SECTORCOUNTS_MAX) {
 error_report(sector count % PRIu64  for chunk % PRIu32
@@ -256,12 +265,12 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 goto fail;
 }
 
-s-offsets[i] = buff_read_uint64(buffer, offset);
+/* offset in (compressed) data fork */
+s-offsets[i] = buff_read_uint64(buffer, offset + 0x18);
 s-offsets[i] += in_offset;
-offset += 8;
 
-s-lengths[i] = buff_read_uint64(buffer, offset);
-offset += 8;
+/* length in (compressed) data fork */
+s-lengths[i] = buff_read_uint64(buffer, offset + 0x20);
 
 if (s-lengths[i]  DMG_LENGTHS_MAX) {
 error_report(length % PRIu64  for chunk % PRIu32
@@ -273,6 +282,7 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 
 update_max_chunk_size(s, i, ds-max_compressed_size,
   ds-max_sectors_per_chunk);
+offset += 40;
 }
 s-n_chunks += chunk_count;
 return 0;
-- 
2.1.0

[Qemu-devel] [PULL 05/16] block/dmg: process a buffer instead of reading ints

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

As the decoded plist XML is not a pointer in the file,
dmg_read_mish_block must be able to process a buffer instead of a file
pointer. Since the full buffer must be processed, let's change the
return value again to just a success flag.

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-5-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 60 ++--
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index 04bae72..4f56227 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -100,6 +100,16 @@ static int read_uint32(BlockDriverState *bs, int64_t 
offset, uint32_t *result)
 return 0;
 }
 
+static inline uint64_t buff_read_uint64(const uint8_t *buffer, int64_t offset)
+{
+return be64_to_cpu(*(uint64_t *)buffer[offset]);
+}
+
+static inline uint32_t buff_read_uint32(const uint8_t *buffer, int64_t offset)
+{
+return be32_to_cpu(*(uint32_t *)buffer[offset]);
+}
+
 /* Increase max chunk sizes, if necessary.  This function is used to calculate
  * the buffer sizes needed for compressed/uncompressed chunk I/O.
  */
@@ -182,20 +192,16 @@ typedef struct DmgHeaderState {
 uint32_t max_sectors_per_chunk;
 } DmgHeaderState;
 
-static int dmg_read_mish_block(BlockDriverState *bs, DmgHeaderState *ds,
-   int64_t offset, uint32_t count)
+static int dmg_read_mish_block(BDRVDMGState *s, DmgHeaderState *ds,
+   uint8_t *buffer, uint32_t count)
 {
-BDRVDMGState *s = bs-opaque;
 uint32_t type, i;
 int ret;
 size_t new_size;
 uint32_t chunk_count;
+int64_t offset = 0;
 
-ret = read_uint32(bs, offset, type);
-if (ret  0) {
-goto fail;
-}
-
+type = buff_read_uint32(buffer, offset);
 /* skip data that is not a valid MISH block (invalid magic or too small) */
 if (type != 0x6d697368 || count  244) {
 /* assume success for now */
@@ -214,10 +220,7 @@ static int dmg_read_mish_block(BlockDriverState *bs, 
DmgHeaderState *ds,
 s-sectorcounts = g_realloc(s-sectorcounts, new_size);
 
 for (i = s-n_chunks; i  s-n_chunks + chunk_count; i++) {
-ret = read_uint32(bs, offset, s-types[i]);
-if (ret  0) {
-goto fail;
-}
+s-types[i] = buff_read_uint32(buffer, offset);
 offset += 4;
 if (s-types[i] != 0x8005  s-types[i] != 1 
 s-types[i] != 2) {
@@ -233,17 +236,11 @@ static int dmg_read_mish_block(BlockDriverState *bs, 
DmgHeaderState *ds,
 }
 offset += 4;
 
-ret = read_uint64(bs, offset, s-sectors[i]);
-if (ret  0) {
-goto fail;
-}
+s-sectors[i] = buff_read_uint64(buffer, offset);
 s-sectors[i] += ds-last_out_offset;
 offset += 8;
 
-ret = read_uint64(bs, offset, s-sectorcounts[i]);
-if (ret  0) {
-goto fail;
-}
+s-sectorcounts[i] = buff_read_uint64(buffer, offset);
 offset += 8;
 
 if (s-sectorcounts[i]  DMG_SECTORCOUNTS_MAX) {
@@ -254,17 +251,11 @@ static int dmg_read_mish_block(BlockDriverState *bs, 
DmgHeaderState *ds,
 goto fail;
 }
 
-ret = read_uint64(bs, offset, s-offsets[i]);
-if (ret  0) {
-goto fail;
-}
+s-offsets[i] = buff_read_uint64(buffer, offset);
 s-offsets[i] += ds-last_in_offset;
 offset += 8;
 
-ret = read_uint64(bs, offset, s-lengths[i]);
-if (ret  0) {
-goto fail;
-}
+s-lengths[i] = buff_read_uint64(buffer, offset);
 offset += 8;
 
 if (s-lengths[i]  DMG_LENGTHS_MAX) {
@@ -288,8 +279,10 @@ fail:
 static int dmg_read_resource_fork(BlockDriverState *bs, DmgHeaderState *ds,
   uint64_t info_begin, uint64_t info_length)
 {
+BDRVDMGState *s = bs-opaque;
 int ret;
 uint32_t count, rsrc_data_offset;
+uint8_t *buffer = NULL;
 uint64_t info_end;
 uint64_t offset;
 
@@ -330,16 +323,23 @@ static int dmg_read_resource_fork(BlockDriverState *bs, 
DmgHeaderState *ds,
 }
 offset += 4;
 
-ret = dmg_read_mish_block(bs, ds, offset, count);
+buffer = g_realloc(buffer, count);
+ret = bdrv_pread(bs-file, offset, buffer, count);
+if (ret  0) {
+goto fail;
+}
+
+ret = dmg_read_mish_block(s, ds, buffer, count);
 if (ret  0) {
 goto fail;
 }
 /* advance offset by size of resource */
 offset += count;
 }
-return 0;
+ret = 0;
 
 fail:
+g_free(buffer);
 return ret;
 }
 
-- 
2.1.0

[Qemu-devel] [PULL 16/16] qemu-iotests: Fix supported_oses check

2015-01-16 Thread Stefan Hajnoczi

From: Fam Zheng f...@redhat.com

There is a bug in the recently added sys.platform test, and we no longer
run python tests, because linux2 is the value to compare here. So do a
prefix match. According to python doc [1], the way to use sys.platform
is unless you want to test for a specific system version, it is
therefore recommended to use the following idiom:

if sys.platform.startswith('freebsd'):
# FreeBSD-specific code here...
elif sys.platform.startswith('linux'):
# Linux-specific code here...

[1]: https://docs.python.org/2.7/library/sys.html#sys.platform

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 tests/qemu-iotests/iotests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 87002e0..241b5ee 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -288,7 +288,7 @@ def main(supported_fmts=[], supported_oses=['linux']):
 if supported_fmts and (imgfmt not in supported_fmts):
 notrun('not suitable for this image format: %s' % imgfmt)
 
-if sys.platform not in supported_oses:
+if True not in [sys.platform.startswith(x) for x in supported_oses]:
 notrun('not suitable for this OS: %s' % sys.platform)
 
 # We need to filter out the time taken from the output so that qemu-iotest
-- 
2.1.0

[Qemu-devel] [PULL 10/16] block/dmg: use SectorNumber from BLKX header

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

Previously the sector table parsing relied on the previous offset of
the DMG file. Now it uses the sector number from the BLKX header
(see http://newosxbook.com/DMG.html).

The implementation of dmg2img (from vu1tur) does not base the output
sector on the location of the terminator (0x) either so it
should be safe to drop this dependency on the previous state.

(It makes somehow makes sense, a terminator should halt further
processing of a block and is perhaps used to preallocate some space.)

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-10-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index 65fc7fe..0f278c8 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -187,7 +187,6 @@ typedef struct DmgHeaderState {
 /* used internally by dmg_read_mish_block to remember offsets of blocks
  * across calls */
 uint64_t data_fork_offset;
-uint64_t last_out_offset;
 /* exported for dmg_open */
 uint32_t max_compressed_size;
 uint32_t max_sectors_per_chunk;
@@ -203,6 +202,7 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 int64_t offset = 0;
 uint64_t data_offset;
 uint64_t in_offset = ds-data_fork_offset;
+uint64_t out_offset;
 
 type = buff_read_uint32(buffer, offset);
 /* skip data that is not a valid MISH block (invalid magic or too small) */
@@ -211,6 +211,9 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 return 0;
 }
 
+/* chunk offsets are relative to this sector number */
+out_offset = buff_read_uint64(buffer, offset + 8);
+
 /* location in data fork for (compressed) blob (in bytes) */
 data_offset = buff_read_uint64(buffer, offset + 0x18);
 in_offset += data_offset;
@@ -231,10 +234,6 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 offset += 4;
 if (s-types[i] != 0x8005  s-types[i] != 1 
 s-types[i] != 2) {
-if (s-types[i] == 0x  i  0) {
-ds-last_out_offset = s-sectors[i - 1] +
-  s-sectorcounts[i - 1];
-}
 chunk_count--;
 i--;
 offset += 36;
@@ -243,7 +242,7 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 offset += 4;
 
 s-sectors[i] = buff_read_uint64(buffer, offset);
-s-sectors[i] += ds-last_out_offset;
+s-sectors[i] += out_offset;
 offset += 8;
 
 s-sectorcounts[i] = buff_read_uint64(buffer, offset);
@@ -417,7 +416,6 @@ static int dmg_open(BlockDriverState *bs, QDict *options, 
int flags,
 s-offsets = s-lengths = s-sectors = s-sectorcounts = NULL;
 /* used by dmg_read_mish_block to keep track of the current I/O position */
 ds.data_fork_offset = 0;
-ds.last_out_offset = 0;
 ds.max_compressed_size = 1;
 ds.max_sectors_per_chunk = 1;
 
-- 
2.1.0

Re: [Qemu-devel] [PATCH v11 04/13] block: Introduce bdrv_dirty_bitmap_granularity()

2015-01-16 Thread Max Reitz


On 2015-01-12 at 11:30, John Snow wrote:

From: Fam Zheng f...@redhat.com

This returns the granularity (in bytes) of dirty bitmap,
which matches the QMP interface and the existing query
interface.

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   | 9 +++--
  include/block/block.h | 2 ++
  2 files changed, 9 insertions(+), 2 deletions(-)


Reviewed-by: Max Reitz mre...@redhat.com

[Qemu-devel] [PULL 15/16] qemu-iotests: add 116 invalid QED input file tests

2015-01-16 Thread Stefan Hajnoczi

These tests exercise error code paths in the QED image format.  The
tests are very simple, they just prove that the error path exits
cleanly.

Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
Message-id: 1421065893-18875-3-git-send-email-stefa...@redhat.com
Reviewed-by: Kevin Wolf kw...@redhat.com
---
 tests/qemu-iotests/116 | 96 ++
 tests/qemu-iotests/116.out | 37 ++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 134 insertions(+)
 create mode 100755 tests/qemu-iotests/116
 create mode 100644 tests/qemu-iotests/116.out

diff --git a/tests/qemu-iotests/116 b/tests/qemu-iotests/116
new file mode 100755
index 000..713ed48
--- /dev/null
+++ b/tests/qemu-iotests/116
@@ -0,0 +1,96 @@
+#!/bin/bash
+#
+# Test error code paths for invalid QED images
+#
+# The aim of this test is to exercise the error paths in qed_open() to ensure
+# there are no crashes with invalid input files.
+#
+# Copyright (C) 2015 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see http://www.gnu.org/licenses/.
+#
+
+# creator
+owner=stefa...@redhat.com
+
+seq=`basename $0`
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+_cleanup()
+{
+   _cleanup_test_img
+}
+trap _cleanup; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qed
+_supported_proto generic
+_supported_os Linux
+
+
+size=128M
+
+echo
+echo == truncated header cluster ==
+_make_test_img $size
+truncate -s 512 $TEST_IMG
+$QEMU_IO -f $IMGFMT -c read 0 $size $TEST_IMG 21 | _filter_qemu_io | 
_filter_testdir
+
+echo
+echo == invalid header magic ==
+_make_test_img $size
+poke_file $TEST_IMG 0 QEDX
+$QEMU_IO -f $IMGFMT -c read 0 $size $TEST_IMG 21 | _filter_qemu_io | 
_filter_testdir
+
+echo
+echo == invalid cluster size ==
+_make_test_img $size
+poke_file $TEST_IMG 4 \xff\xff\xff\xff
+$QEMU_IO -f $IMGFMT -c read 0 $size $TEST_IMG 21 | _filter_qemu_io | 
_filter_testdir
+
+echo
+echo == invalid table size ==
+_make_test_img $size
+poke_file $TEST_IMG 8 \xff\xff\xff\xff
+$QEMU_IO -f $IMGFMT -c read 0 $size $TEST_IMG 21 | _filter_qemu_io | 
_filter_testdir
+
+echo
+echo == invalid header size ==
+_make_test_img $size
+poke_file $TEST_IMG 12 \xff\xff\xff\xff
+$QEMU_IO -f $IMGFMT -c read 0 $size $TEST_IMG 21 | _filter_qemu_io | 
_filter_testdir
+
+echo
+echo == invalid L1 table offset ==
+_make_test_img $size
+poke_file $TEST_IMG 40 \xff\xff\xff\xff\xff\xff\xff\xff
+$QEMU_IO -f $IMGFMT -c read 0 $size $TEST_IMG 21 | _filter_qemu_io | 
_filter_testdir
+
+echo
+echo == invalid image size ==
+_make_test_img $size
+poke_file $TEST_IMG 48 \xff\xff\xff\xff\xff\xff\xff\xff
+$QEMU_IO -f $IMGFMT -c read 0 $size $TEST_IMG 21 | _filter_qemu_io | 
_filter_testdir
+
+# success, all done
+echo *** done
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/116.out b/tests/qemu-iotests/116.out
new file mode 100644
index 000..b679cee
--- /dev/null
+++ b/tests/qemu-iotests/116.out
@@ -0,0 +1,37 @@
+QA output created by 116
+
+== truncated header cluster ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+qemu-io: can't open device TEST_DIR/t.qed: Could not open 'TEST_DIR/t.qed': 
Invalid argument
+no file open, try 'help open'
+
+== invalid header magic ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+qemu-io: can't open device TEST_DIR/t.qed: Image not in QED format
+no file open, try 'help open'
+
+== invalid cluster size ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+qemu-io: can't open device TEST_DIR/t.qed: Could not open 'TEST_DIR/t.qed': 
Invalid argument
+no file open, try 'help open'
+
+== invalid table size ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+qemu-io: can't open device TEST_DIR/t.qed: Could not open 'TEST_DIR/t.qed': 
Invalid argument
+no file open, try 'help open'
+
+== invalid header size ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+qemu-io: can't open device TEST_DIR/t.qed: Could not open 'TEST_DIR/t.qed': 
Invalid argument
+no file open, try 'help open'
+
+== invalid L1 table offset ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+qemu-io: can't open device TEST_DIR/t.qed: Could not open 'TEST_DIR/t.qed': 
Invalid argument
+no file open, try 'help open'
+
+== invalid image size ==
+Formatting

[Qemu-devel] [PULL 13/16] block/dmg: improve zeroes handling

2015-01-16 Thread Stefan Hajnoczi

From: Peter Wu pe...@lekensteyn.nl

Disk images may contain large all-zeroes gaps (1.66k sectors or 812 MiB
is seen in the real world). These blocks (type 2) do not need to be
extracted into a temporary buffer, there is no need to allocate memory
for these blocks nor to check its length.

(For the test image, the maximum uncompressed size is 1054371 bytes,
probably for a bzip2-compressed block.)

Signed-off-by: Peter Wu pe...@lekensteyn.nl
Reviewed-by: John Snow js...@redhat.com
Message-id: 1420566495-13284-13-git-send-email-pe...@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi stefa...@redhat.com
---
 block/dmg.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/block/dmg.c b/block/dmg.c
index 196c746..66a4b3f 100644
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -137,7 +137,9 @@ static void update_max_chunk_size(BDRVDMGState *s, uint32_t 
chunk,
 uncompressed_sectors = (s-lengths[chunk] + 511) / 512;
 break;
 case 2: /* zero */
-uncompressed_sectors = s-sectorcounts[chunk];
+/* as the all-zeroes block may be large, it is treated specially: the
+ * sector is not copied from a large buffer, a simple memset is used
+ * instead. Therefore uncompressed_sectors does not need to be set. */
 break;
 }
 
@@ -267,7 +269,9 @@ static int dmg_read_mish_block(BDRVDMGState *s, 
DmgHeaderState *ds,
 /* sector count */
 s-sectorcounts[i] = buff_read_uint64(buffer, offset + 0x10);
 
-if (s-sectorcounts[i]  DMG_SECTORCOUNTS_MAX) {
+/* all-zeroes sector (type 2) does not need to be uncompressed and 
can
+ * therefore be unbounded. */
+if (s-types[i] != 2  s-sectorcounts[i]  DMG_SECTORCOUNTS_MAX) {
 error_report(sector count % PRIu64  for chunk % PRIu32
   is larger than max (%u),
  s-sectorcounts[i], i, DMG_SECTORCOUNTS_MAX);
@@ -642,7 +646,8 @@ static inline int dmg_read_chunk(BlockDriverState *bs, 
uint64_t sector_num)
 }
 break;
 case 2: /* zero */
-memset(s-uncompressed_chunk, 0, 512 * s-sectorcounts[chunk]);
+/* see dmg_read, it is treated specially. No buffer needs to be
+ * pre-filled, the zeroes can be set directly. */
 break;
 }
 s-current_chunk = chunk;
@@ -661,6 +666,13 @@ static int dmg_read(BlockDriverState *bs, int64_t 
sector_num,
 if (dmg_read_chunk(bs, sector_num + i) != 0) {
 return -1;
 }
+/* Special case: current chunk is all zeroes. Do not perform a memcpy 
as
+ * s-uncompressed_chunk may be too small to cover the large all-zeroes
+ * section. dmg_read_chunk is called to find s-current_chunk */
+if (s-types[s-current_chunk] == 2) { /* all zeroes block entry */
+memset(buf + i * 512, 0, 512);
+continue;
+}
 sector_offset_in_chunk = sector_num + i - s-sectors[s-current_chunk];
 memcpy(buf + i * 512,
s-uncompressed_chunk + sector_offset_in_chunk * 512, 512);
-- 
2.1.0

Re: [Qemu-devel] [PATCH v11 05/13] block: Add bdrv_clear_dirty_bitmap

2015-01-16 Thread Max Reitz


On 2015-01-12 at 11:30, John Snow wrote:

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   | 24 
  include/block/block.h |  1 +
  2 files changed, 21 insertions(+), 4 deletions(-)


Reviewed-by: Max Reitz mre...@redhat.com

Re: [Qemu-devel] [PATCH v11 06/13] hbitmap: add hbitmap_merge

2015-01-16 Thread Max Reitz


On 2015-01-12 at 11:30, John Snow wrote:

We add a bitmap merge operation to assist in error cases
where we wish to combine two bitmaps together.

This is algorithmically O(bits) provided HBITMAP_LEVELS remains
constant. For a full bitmap on a 64bit machine:
sum(bits/64^k, k, 0, HBITMAP_LEVELS) ~= 1.01587 * bits

We may be able to improve running speed for particularly sparse
bitmaps by using iterators, but the running time for dense maps
will be worse.

We present the simpler solution first, and we can refine it later
if needed.

Signed-off-by: John Snow js...@redhat.com
---
  include/qemu/hbitmap.h | 11 +++
  util/hbitmap.c | 28 
  2 files changed, 39 insertions(+)


Reviewed-by: Max Reitz mre...@redhat.com

Re: [Qemu-devel] [PATCH 2/2] tcg-arm: more instruction execution control

2015-01-16 Thread Peter Maydell

On 13 January 2015 at 15:48, Andrew Jones drjo...@redhat.com wrote:
 Cleanup XN/PXN handling in get_phys_addr_lpae, and implement all but
 EL2 support of the following ARMv8 sections

   D4.5.1 Memory access control: Access permissions for instruction
  execution
   G4.7.2 Execute-never restrictions on instruction fetching

 G4.7.2 matches the ARMv7 section B3.7.2 when long-descriptors are used.

 Signed-off-by: Andrew Jones drjo...@redhat.com

 ---
 v3:
   - correct logic for (is_user  !(ap  1)) case
   - base on user doesn't need R/W access to exec patch
 v2: correct assert in el2 case

 I didn't test this with secure mode (I didn't even check if that's
 currently possible), but I did test all other EL10 XN controls with
 both tcg-arm (cortex-a15) and tcg-aarch64 (cortex-a57) by hacking up
 some tests with kvm-unit-tests/arm[64]. I also booted Linux (just
 up to looking for a rootfs) with both.

  target-arm/helper.c | 80 
 +++--
  1 file changed, 66 insertions(+), 14 deletions(-)

 diff --git a/target-arm/helper.c b/target-arm/helper.c
 index 7c30a2669a0f2..715e0f47bec7d 100644
 --- a/target-arm/helper.c
 +++ b/target-arm/helper.c
 @@ -4516,6 +4516,69 @@ static inline int check_ap(CPUARMState *env, int ap, 
 int domain_prot,
}
  }

 +/* Check section/page execution permission.
 + *
 + * Returns PAGE_EXEC if execution is permitted, otherwise zero.
 + *
 + * Supports AArch64, AArch32, and v7-LPAE. Doesn't yet support
 + * EL2. Actually, for v7-LPAE, we'd also need to emulate Virt
 + * Extensions to truly have WXN/UWXN. We don't support checking
 + * for that yet, so we just assume we have them.

If you want to know if the Virtualization Extensions are
present, check feature bit ARM_FEATURE_EL2.

In general this code seems to be rather short on feature
bit checks -- notice that the code it is replacing was
doing checks on the V8 feature bit, for instance.

 + *
 + * @env : CPUARMState
 + * @is_user : 0 for privileged access, 1 for user
 + * @ap  : Access permissions (AP[2:1]) from descriptor
 + * @ns  : (NSTable || NS) from descriptors
 + * @xn  : ([U]XNTable || [U]XN) from descriptors
 + * @pxn : (PXNTable || PXN) from descriptors
 + */
 +static int check_xn_lpae(CPUARMState *env, int is_user, int ap,
 + int ns, int xn, int pxn)
 +{
 +int wxn;
 +
 +switch (arm_current_el(env)) {
 +case 0:

If arm_current_el() is 0 but EL3 is 32 bit then the
controlling environment here is the Secure PL1, which
is EL3, and falling through to look at sctlr_el[1] is
wrong. (SCTLR(S) is sctlr_el[3].)

(Also get_phys_addr() can be called via the ats_write()
functions, which means that arm_current_el() and the
address translation regime we're trying to determine the
answer for aren't necessarily the same...)

I think we're very soon going to need to bite the bullet and
make this code have a concept of the current translation
regime, as the ARM ARM terms it...

If you wish to avoid that (and I wouldn't object, given
this patchset is snowballing in scope already) then do what
get_phys_addr() does to get the relevant SCTLR:
uint32_t sctlr = A32_BANKED_CURRENT_REG_GET(env, sctlr);
This will work for everything we currently support:
 * AArch64 EL1 (with an EL0 either AArch32 or AArch64)
 * AArch32 EL1
 * TZ for the limited case of EL3 == AArch32 (and so no EL1)

 +case 1:
 +wxn = env-cp15.sctlr_el[1]  SCTLR_WXN;
 +if (arm_el_is_aa64(env, 1)) {
 +if (is_user  !(ap  1)) {
 +wxn = 0;
 +}

I don't understand what we're doing here: as far as I can see
the ARM ARM simply defines that WXN is the bit from the
relevant SCTLR, and it always applies if the page is writable.
I have a feeling you're re-calculating half of is this
page writable here, and then doing the other half below.
It would be cleaner to pass in the 'prot' bits that have
already been calculated.

 +pxn = pxn || ((ap  1)  !(ap  2));

Very weird way to write ap == 1 ?

 +xn = is_user ? xn : pxn;
 +} else {
 +int uwxn = env-cp15.sctlr_el[1]  SCTLR_UWXN;
 +pxn = pxn || (uwxn  (ap  1)  !(ap  2));
 +xn = xn || (!is_user  pxn) || (is_user  !(ap  1));
 +}
 +if (arm_is_secure(env)) {
 +xn = xn || (ns  (env-cp15.scr_el3  SCR_SIF));
 +}
 +break;
 +case 2:
 +/* TODO actually support EL2 */
 +assert(false);
 +
 +if (arm_el_is_aa64(env, 2)) {
 +wxn = env-cp15.sctlr_el[2]  SCTLR_WXN;
 +} else {
 +/* wxn = HSCTLR.WXN */
 +}

You can just write this as
wxn = env-cp15.sctlr_el[2]  SCTLR_WXN;

because SCTLR_EL2 and HSCTLR are architecturally mapped and so
we will store the state in the same env-cp15 field whether
EL2 is 32 or 64 bit.

 +break;
 +case 3:
 +if (arm_el_is_aa64(env, 3)) {
 +

Re: [Qemu-devel] [PATCH v11 07/13] qmp: Add block-dirty-bitmap-enable and block-dirty-bitmap-disable

2015-01-16 Thread Max Reitz


On 2015-01-12 at 11:30, John Snow wrote:

From: Fam Zheng f...@redhat.com

This allows to put the dirty bitmap into a disabled state where no more
writes will be tracked.

It will be used before backup or writing to persistent file.

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   | 24 +++-
  blockdev.c| 40 
  include/block/block.h |  3 +++
  qapi/block-core.json  | 28 
  qmp-commands.hx   | 10 ++
  5 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 7bf7079..bd4b449 100644
--- a/block.c
+++ b/block.c
@@ -56,6 +56,7 @@ struct BdrvDirtyBitmap {
  int64_t size;
  int64_t granularity;
  char *name;
+bool enabled;
  QLIST_ENTRY(BdrvDirtyBitmap) list;
  };
  
@@ -5373,10 +5374,16 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState *bs,

  bitmap-granularity = granularity;
  bitmap-bitmap = hbitmap_alloc(bitmap-size, ffs(sector_granularity) - 1);
  bitmap-name = g_strdup(name);
+bitmap-enabled = true;
  QLIST_INSERT_HEAD(bs-dirty_bitmaps, bitmap, list);
  return bitmap;
  }
  
+bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap)

+{
+return bitmap-enabled;
+}
+
  void bdrv_release_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap *bitmap)
  {
  BdrvDirtyBitmap *bm, *next;
@@ -5391,6 +5398,16 @@ void bdrv_release_dirty_bitmap(BlockDriverState *bs, 
BdrvDirtyBitmap *bitmap)
  }
  }
  
+void bdrv_disable_dirty_bitmap(BdrvDirtyBitmap *bitmap)

+{
+bitmap-enabled = false;
+}
+
+void bdrv_enable_dirty_bitmap(BdrvDirtyBitmap *bitmap)
+{
+bitmap-enabled = true;
+}
+
  BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs)
  {
  BdrvDirtyBitmap *bm;
@@ -5458,7 +5475,9 @@ void bdrv_dirty_iter_init(BlockDriverState *bs,
  void bdrv_set_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
 int64_t cur_sector, int nr_sectors)
  {
-hbitmap_set(bitmap-bitmap, cur_sector, nr_sectors);
+if (bdrv_dirty_bitmap_enabled(bitmap)) {
+hbitmap_set(bitmap-bitmap, cur_sector, nr_sectors);
+}


Why are you checking whether the bitmap is enabled here in 
bdrv_set_dirty_bitmap(), but neither in bdrv_reset_dirty_bitmap() nor 
bdrv_clear_dirty_bitmap()?


It seems consistent with the commit message which only states that 
disabled state means that no more writes (i.e. bdrv_set_dirty_bitmap()) 
will be tracked, but it still seems strange to me.



  }
  
  void bdrv_reset_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,

@@ -5481,6 +5500,9 @@ static void bdrv_set_dirty(BlockDriverState *bs, int64_t 
cur_sector,
  {
  BdrvDirtyBitmap *bitmap;
  QLIST_FOREACH(bitmap, bs-dirty_bitmaps, list) {
+if (!bdrv_dirty_bitmap_enabled(bitmap)) {
+continue;
+}


Same question as above, only this time it's about bdrv_reset_dirty().

In case the answer to the question is that's intentional:

Reviewed-by: Max Reitz mre...@redhat.com

Re: [Qemu-devel] Press Inquiry: Qemu Advent Calendar (German Linux Magazin)

2015-01-16 Thread Matthew Hungerford

Sounds awesome, will have to order some copies for our office.

Sorry about the earlier email not being reply-all, was on mobile email and
sent it incorrectly.

Matthew Hungerford
Pebble Firmware Engineer

On Thu, Jan 15, 2015 at 1:18 PM, Tim Schürmann i...@tim-schuermann.de
wrote:

 Hi Stefan,

 if nothing went wrong the (whole) advent calendar will be on the DVD
 published with the Linux Magazin issue 03/2015. The release date is the 05.
 Feb. 2015 (in germany).

 Best regards,

 Tim Schürmann
 i...@tim-schuermann.de

 Am 15.01.2015 um 15:47 schrieb Stefan Hajnoczi:

  On Tue, Jan 13, 2015 at 01:35:55PM +0100, Tim Schürmann wrote:

 Hi Tim,

  In order to comply with the license on GPL images you will need to offer
 the source code (see links on http://qemu-advent-calendar.org/ for each
 image).


 The Ubuntu Core and Ceph sourcecodes won't fit on our DVD. So we have to
 provide links to the corresponding websites. Is this Ok with you? (My
 editor-in-chief said that this conform with the GPL.)


 I am not the (only) copyright holder of Ubuntu Core or Ceph, and I'm not
 a lawyer.  You need to decide whether you are in compliance or not.

 My plan is to keep http://qemu-advent-calendar.org/ running in the
 foreseeable future so the sources should remain accessible.  If you
 mirror the sources it would be best - that way you can continue
 providing source even if http://qemu-advent-calendar.org/ were to go
 offline.

  I'm CCing Matthew Hungerford at Pebble to confirm that the Pebble
 Smartwatch image can also be distributed.


 Did you got an answer already?


 Yes, Matthew responded.  It was a Reply instead of Reply-All.  I forgot
 to forward you his response (sorry):

 Date: Fri, 9 Jan 2015 11:12:52 -0800
 From: Matthew Hungerford matthew.hungerf...@getpebble.com

Sounds great,

One of my german coworkers is a fan of Linux Magazin, and would get a
kick out if it being featured there.

Thanks for asking,
Matthew


 All the best,
 Stefan

Re: [Qemu-devel] [PULL 00/16] Block patches

2015-01-16 Thread Peter Maydell

On 16 January 2015 at 15:36, Stefan Hajnoczi stefa...@redhat.com wrote:
 The following changes since commit df58887b20fab8fe8a6dcca4db30cd4e4077d53a:

   Merge remote-tracking branch 
 'remotes/mjt/tags/pull-trivial-patches-2015-01-15' into staging (2015-01-15 
 10:08:46 +)

 are available in the git repository at:

   git://github.com/stefanha/qemu.git tags/block-pull-request

 for you to fetch changes up to d6e2630f17fa40342af80d20c04f279fa23ea189:

   qemu-iotests: Fix supported_oses check (2015-01-16 15:36:34 +)

Hi. I'm afraid this doesn't build on all platforms, due to
dependencies on newer glib versions than our minimum
requirement.

OSX:

  CCtests/test-write-threshold.o
/Users/pm215/src/qemu/tests/test-write-threshold.c:21:5: warning:
implicit declaration of function
  'g_assert_false' is invalid in C99 [-Wimplicit-function-declaration]
g_assert_false(bdrv_write_threshold_is_set(bs));
^
1 warning generated.
  LINK  tests/test-write-threshold
Undefined symbols for architecture x86_64:
  _g_assert_false, referenced from:
  _test_threshold_not_set_on_init in test-write-threshold.o
ld: symbol(s) not found for architecture x86_64

CentOS5:

../block/dmg.o: In function `dmg_read_plist_xml':
/home/petmay01/linaro/qemu-for-merges/block/dmg.c:414: undefined
reference to `g_base64_decode_inplace'

-- PMM

Re: [Qemu-devel] [PATCH v11 03/13] qmp: Add block-dirty-bitmap-add and block-dirty-bitmap-remove

2015-01-16 Thread John Snow




On 01/16/2015 10:36 AM, Max Reitz wrote:

On 2015-01-12 at 11:30, John Snow wrote:

From: Fam Zheng f...@redhat.com

The new command pair is added to manage user created dirty bitmap. The
dirty bitmap's name is mandatory and must be unique for the same device,
but different devices can have bitmaps with the same names.

The granularity is an optional field. If it is not specified, we will
choose a default granularity based on the cluster size if available,
clamped to between 4K and 64K to mirror how the 'mirror' code was
already choosing granularity. If we do not have cluster size info
available, we choose 64K. This code has been factored out into a helper
shared with block/mirror.

This patch also introduces the 'block_dirty_bitmap_lookup' helper,
which takes a device name and a dirty bitmap name and validates the
lookup, returning NULL and setting errp if there is a problem with
either field. This helper will be re-used in future patches in this
series.

The types added to block-core.json will be re-used in future patches
in this series, see:
'qapi: Add transaction support to block-dirty-bitmap-{add, enable,
disable}'

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   |  20 ++
  block/mirror.c|  10 +
  blockdev.c| 100
++
  include/block/block.h |   1 +
  qapi/block-core.json  |  55 +++
  qmp-commands.hx   |  51 +
  6 files changed, 228 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index bfeae6b..3eb77ee 100644
--- a/block.c
+++ b/block.c
@@ -5417,6 +5417,26 @@ int bdrv_get_dirty(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap, int64_t sector
  }
  }
+/**
+ * Chooses a default granularity based on the existing cluster size,
+ * but clamped between [4K, 64K]. Defaults to 64K in the case that there
+ * is no cluster size information available.
+ */
+uint64_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs)
+{
+BlockDriverInfo bdi;
+uint64_t granularity;
+
+if (bdrv_get_info(bs, bdi) = 0  bdi.cluster_size != 0) {
+granularity = MAX(4096, bdi.cluster_size);
+granularity = MIN(65536, granularity);
+} else {
+granularity = 65536;
+}
+
+return granularity;
+}
+
  void bdrv_dirty_iter_init(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap, HBitmapIter *hbi)
  {
diff --git a/block/mirror.c b/block/mirror.c
index d819952..fc545f1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -667,15 +667,7 @@ static void mirror_start_job(BlockDriverState
*bs, BlockDriverState *target,
  MirrorBlockJob *s;
  if (granularity == 0) {
-/* Choose the default granularity based on the target file's
cluster
- * size, clamped between 4k and 64k.  */
-BlockDriverInfo bdi;
-if (bdrv_get_info(target, bdi) = 0  bdi.cluster_size != 0) {
-granularity = MAX(4096, bdi.cluster_size);
-granularity = MIN(65536, granularity);
-} else {
-granularity = 65536;
-}
+granularity = bdrv_get_default_bitmap_granularity(target);
  }
  assert ((granularity  (granularity - 1)) == 0);
diff --git a/blockdev.c b/blockdev.c
index 5651a8e..95251c7 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1173,6 +1173,48 @@ out_aio_context:
  return NULL;
  }
+/**
+ * Return a dirty bitmap (if present), after validating
+ * the node reference and bitmap names. Returns NULL on error,
+ * including when the BDS and/or bitmap is not found.
+ */
+static BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char *node_ref,
+  const char *name,
+  BlockDriverState
**pbs,
+  Error **errp)
+{
+BlockDriverState *bs;
+BdrvDirtyBitmap *bitmap;
+
+if (!node_ref) {
+error_setg(errp, Node reference cannot be NULL);
+return NULL;
+}
+if (!name) {
+error_setg(errp, Bitmap name cannot be NULL);
+return NULL;
+}
+
+bs = bdrv_lookup_bs(node_ref, node_ref, errp);
+if (!bs) {
+error_setg(errp, Node reference '%s' not found, node_ref);


No need to throw the (hopefully) perfectly fine Error code returned by
bdrv_lookup_bs() away.



I just wanted an error message consistent with the parameter name, in 
this case. i.e., We couldn't find the Node reference instead of 
device or node name. Just trying to distinguish the fact that this 
is an arbitrary reference in the error message.


I can still remove it, but I am curious to see what Markus thinks of the 
names I have chosen before I monkey with the errors too much more.



+return NULL;
+}
+
+/* If caller provided a BDS*, provide the result of that lookup,
too. */
+if (pbs) {
+*pbs = bs;
+}
+
+bitmap =

Re: [Qemu-devel] [PATCH v11 03/13] qmp: Add block-dirty-bitmap-add and block-dirty-bitmap-remove

2015-01-16 Thread Max Reitz


On 2015-01-16 at 11:48, John Snow wrote:



On 01/16/2015 10:36 AM, Max Reitz wrote:

On 2015-01-12 at 11:30, John Snow wrote:

From: Fam Zheng f...@redhat.com

The new command pair is added to manage user created dirty bitmap. The
dirty bitmap's name is mandatory and must be unique for the same 
device,

but different devices can have bitmaps with the same names.

The granularity is an optional field. If it is not specified, we will
choose a default granularity based on the cluster size if available,
clamped to between 4K and 64K to mirror how the 'mirror' code was
already choosing granularity. If we do not have cluster size info
available, we choose 64K. This code has been factored out into a helper
shared with block/mirror.

This patch also introduces the 'block_dirty_bitmap_lookup' helper,
which takes a device name and a dirty bitmap name and validates the
lookup, returning NULL and setting errp if there is a problem with
either field. This helper will be re-used in future patches in this
series.

The types added to block-core.json will be re-used in future patches
in this series, see:
'qapi: Add transaction support to block-dirty-bitmap-{add, enable,
disable}'

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   |  20 ++
  block/mirror.c|  10 +
  blockdev.c| 100
++
  include/block/block.h |   1 +
  qapi/block-core.json  |  55 +++
  qmp-commands.hx   |  51 +
  6 files changed, 228 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index bfeae6b..3eb77ee 100644
--- a/block.c
+++ b/block.c
@@ -5417,6 +5417,26 @@ int bdrv_get_dirty(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap, int64_t sector
  }
  }
+/**
+ * Chooses a default granularity based on the existing cluster size,
+ * but clamped between [4K, 64K]. Defaults to 64K in the case that 
there

+ * is no cluster size information available.
+ */
+uint64_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs)
+{
+BlockDriverInfo bdi;
+uint64_t granularity;
+
+if (bdrv_get_info(bs, bdi) = 0  bdi.cluster_size != 0) {
+granularity = MAX(4096, bdi.cluster_size);
+granularity = MIN(65536, granularity);
+} else {
+granularity = 65536;
+}
+
+return granularity;
+}
+
  void bdrv_dirty_iter_init(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap, HBitmapIter *hbi)
  {
diff --git a/block/mirror.c b/block/mirror.c
index d819952..fc545f1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -667,15 +667,7 @@ static void mirror_start_job(BlockDriverState
*bs, BlockDriverState *target,
  MirrorBlockJob *s;
  if (granularity == 0) {
-/* Choose the default granularity based on the target file's
cluster
- * size, clamped between 4k and 64k.  */
-BlockDriverInfo bdi;
-if (bdrv_get_info(target, bdi) = 0  bdi.cluster_size != 
0) {

-granularity = MAX(4096, bdi.cluster_size);
-granularity = MIN(65536, granularity);
-} else {
-granularity = 65536;
-}
+granularity = bdrv_get_default_bitmap_granularity(target);
  }
  assert ((granularity  (granularity - 1)) == 0);
diff --git a/blockdev.c b/blockdev.c
index 5651a8e..95251c7 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1173,6 +1173,48 @@ out_aio_context:
  return NULL;
  }
+/**
+ * Return a dirty bitmap (if present), after validating
+ * the node reference and bitmap names. Returns NULL on error,
+ * including when the BDS and/or bitmap is not found.
+ */
+static BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char 
*node_ref,

+  const char *name,
+ BlockDriverState
**pbs,
+  Error **errp)
+{
+BlockDriverState *bs;
+BdrvDirtyBitmap *bitmap;
+
+if (!node_ref) {
+error_setg(errp, Node reference cannot be NULL);
+return NULL;
+}
+if (!name) {
+error_setg(errp, Bitmap name cannot be NULL);
+return NULL;
+}
+
+bs = bdrv_lookup_bs(node_ref, node_ref, errp);
+if (!bs) {
+error_setg(errp, Node reference '%s' not found, node_ref);


No need to throw the (hopefully) perfectly fine Error code returned by
bdrv_lookup_bs() away.



I just wanted an error message consistent with the parameter name, in 
this case. i.e., We couldn't find the Node reference instead of 
device or node name. Just trying to distinguish the fact that this 
is an arbitrary reference in the error message.


I can still remove it, but I am curious to see what Markus thinks of 
the names I have chosen before I monkey with the errors too much more.


Very well then, but you should clean up the error returned by 
bdrv_lookup_bs() (call error_free()).


Feel free to keep my R-b whichever decision you'll make (as long as

Re: [Qemu-devel] [PATCH v11 03/13] qmp: Add block-dirty-bitmap-add and block-dirty-bitmap-remove

2015-01-16 Thread John Snow




On 01/16/2015 11:51 AM, Max Reitz wrote:

On 2015-01-16 at 11:48, John Snow wrote:



On 01/16/2015 10:36 AM, Max Reitz wrote:

On 2015-01-12 at 11:30, John Snow wrote:

From: Fam Zheng f...@redhat.com

The new command pair is added to manage user created dirty bitmap. The
dirty bitmap's name is mandatory and must be unique for the same
device,
but different devices can have bitmaps with the same names.

The granularity is an optional field. If it is not specified, we will
choose a default granularity based on the cluster size if available,
clamped to between 4K and 64K to mirror how the 'mirror' code was
already choosing granularity. If we do not have cluster size info
available, we choose 64K. This code has been factored out into a helper
shared with block/mirror.

This patch also introduces the 'block_dirty_bitmap_lookup' helper,
which takes a device name and a dirty bitmap name and validates the
lookup, returning NULL and setting errp if there is a problem with
either field. This helper will be re-used in future patches in this
series.

The types added to block-core.json will be re-used in future patches
in this series, see:
'qapi: Add transaction support to block-dirty-bitmap-{add, enable,
disable}'

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   |  20 ++
  block/mirror.c|  10 +
  blockdev.c| 100
++
  include/block/block.h |   1 +
  qapi/block-core.json  |  55 +++
  qmp-commands.hx   |  51 +
  6 files changed, 228 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index bfeae6b..3eb77ee 100644
--- a/block.c
+++ b/block.c
@@ -5417,6 +5417,26 @@ int bdrv_get_dirty(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap, int64_t sector
  }
  }
+/**
+ * Chooses a default granularity based on the existing cluster size,
+ * but clamped between [4K, 64K]. Defaults to 64K in the case that
there
+ * is no cluster size information available.
+ */
+uint64_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs)
+{
+BlockDriverInfo bdi;
+uint64_t granularity;
+
+if (bdrv_get_info(bs, bdi) = 0  bdi.cluster_size != 0) {
+granularity = MAX(4096, bdi.cluster_size);
+granularity = MIN(65536, granularity);
+} else {
+granularity = 65536;
+}
+
+return granularity;
+}
+
  void bdrv_dirty_iter_init(BlockDriverState *bs,
BdrvDirtyBitmap *bitmap, HBitmapIter *hbi)
  {
diff --git a/block/mirror.c b/block/mirror.c
index d819952..fc545f1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -667,15 +667,7 @@ static void mirror_start_job(BlockDriverState
*bs, BlockDriverState *target,
  MirrorBlockJob *s;
  if (granularity == 0) {
-/* Choose the default granularity based on the target file's
cluster
- * size, clamped between 4k and 64k.  */
-BlockDriverInfo bdi;
-if (bdrv_get_info(target, bdi) = 0  bdi.cluster_size !=
0) {
-granularity = MAX(4096, bdi.cluster_size);
-granularity = MIN(65536, granularity);
-} else {
-granularity = 65536;
-}
+granularity = bdrv_get_default_bitmap_granularity(target);
  }
  assert ((granularity  (granularity - 1)) == 0);
diff --git a/blockdev.c b/blockdev.c
index 5651a8e..95251c7 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1173,6 +1173,48 @@ out_aio_context:
  return NULL;
  }
+/**
+ * Return a dirty bitmap (if present), after validating
+ * the node reference and bitmap names. Returns NULL on error,
+ * including when the BDS and/or bitmap is not found.
+ */
+static BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char
*node_ref,
+  const char *name,
+ BlockDriverState
**pbs,
+  Error **errp)
+{
+BlockDriverState *bs;
+BdrvDirtyBitmap *bitmap;
+
+if (!node_ref) {
+error_setg(errp, Node reference cannot be NULL);
+return NULL;
+}
+if (!name) {
+error_setg(errp, Bitmap name cannot be NULL);
+return NULL;
+}
+
+bs = bdrv_lookup_bs(node_ref, node_ref, errp);
+if (!bs) {
+error_setg(errp, Node reference '%s' not found, node_ref);


No need to throw the (hopefully) perfectly fine Error code returned by
bdrv_lookup_bs() away.



I just wanted an error message consistent with the parameter name, in
this case. i.e., We couldn't find the Node reference instead of
device or node name. Just trying to distinguish the fact that this
is an arbitrary reference in the error message.

I can still remove it, but I am curious to see what Markus thinks of
the names I have chosen before I monkey with the errors too much more.


Very well then, but you should clean up the error returned by
bdrv_lookup_bs() (call error_free()).

Feel free to keep my R-b whichever

Re: [Qemu-devel] [PATCH v11 07/13] qmp: Add block-dirty-bitmap-enable and block-dirty-bitmap-disable

2015-01-16 Thread John Snow




On 01/16/2015 11:28 AM, Max Reitz wrote:

On 2015-01-12 at 11:30, John Snow wrote:

From: Fam Zheng f...@redhat.com

This allows to put the dirty bitmap into a disabled state where no more
writes will be tracked.

It will be used before backup or writing to persistent file.

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   | 24 +++-
  blockdev.c| 40 
  include/block/block.h |  3 +++
  qapi/block-core.json  | 28 
  qmp-commands.hx   | 10 ++
  5 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 7bf7079..bd4b449 100644
--- a/block.c
+++ b/block.c
@@ -56,6 +56,7 @@ struct BdrvDirtyBitmap {
  int64_t size;
  int64_t granularity;
  char *name;
+bool enabled;
  QLIST_ENTRY(BdrvDirtyBitmap) list;
  };
@@ -5373,10 +5374,16 @@ BdrvDirtyBitmap
*bdrv_create_dirty_bitmap(BlockDriverState *bs,
  bitmap-granularity = granularity;
  bitmap-bitmap = hbitmap_alloc(bitmap-size,
ffs(sector_granularity) - 1);
  bitmap-name = g_strdup(name);
+bitmap-enabled = true;
  QLIST_INSERT_HEAD(bs-dirty_bitmaps, bitmap, list);
  return bitmap;
  }
+bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap)
+{
+return bitmap-enabled;
+}
+
  void bdrv_release_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap
*bitmap)
  {
  BdrvDirtyBitmap *bm, *next;
@@ -5391,6 +5398,16 @@ void bdrv_release_dirty_bitmap(BlockDriverState
*bs, BdrvDirtyBitmap *bitmap)
  }
  }
+void bdrv_disable_dirty_bitmap(BdrvDirtyBitmap *bitmap)
+{
+bitmap-enabled = false;
+}
+
+void bdrv_enable_dirty_bitmap(BdrvDirtyBitmap *bitmap)
+{
+bitmap-enabled = true;
+}
+
  BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs)
  {
  BdrvDirtyBitmap *bm;
@@ -5458,7 +5475,9 @@ void bdrv_dirty_iter_init(BlockDriverState *bs,
  void bdrv_set_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap
*bitmap,
 int64_t cur_sector, int nr_sectors)
  {
-hbitmap_set(bitmap-bitmap, cur_sector, nr_sectors);
+if (bdrv_dirty_bitmap_enabled(bitmap)) {
+hbitmap_set(bitmap-bitmap, cur_sector, nr_sectors);
+}


Why are you checking whether the bitmap is enabled here in
bdrv_set_dirty_bitmap(), but neither in bdrv_reset_dirty_bitmap() nor
bdrv_clear_dirty_bitmap()?

It seems consistent with the commit message which only states that
disabled state means that no more writes (i.e. bdrv_set_dirty_bitmap())
will be tracked, but it still seems strange to me.


  }
  void bdrv_reset_dirty_bitmap(BlockDriverState *bs, BdrvDirtyBitmap
*bitmap,
@@ -5481,6 +5500,9 @@ static void bdrv_set_dirty(BlockDriverState *bs,
int64_t cur_sector,
  {
  BdrvDirtyBitmap *bitmap;
  QLIST_FOREACH(bitmap, bs-dirty_bitmaps, list) {
+if (!bdrv_dirty_bitmap_enabled(bitmap)) {
+continue;
+}


Same question as above, only this time it's about bdrv_reset_dirty().

In case the answer to the question is that's intentional:

Reviewed-by: Max Reitz mre...@redhat.com



Just because it's intentional doesn't make it a good idea.

Fam is currently suggesting we might just scrap the enable/disable and 
frozen/normal modes altogether in favor of a single unified status, 
but the original way he proposed enabled/disabled is that it simply, and 
exclusively enabled/disabled new writes to the bitmap -- it never was 
intended to block clears/resets.


So the bitmap is still able to be consumed while disabled, for instance.

How useful is this? Perhaps not very useful. Probably more confusing 
than useful.


I will *probably* try to unify these parameters as Fam suggested, but I 
am really eager to hear from Markus on the QMP interface side before I 
bother sending out another minor iteration.


I'll skip the R-b for now, since Fam had some ideas on how to make it nicer.

Thanks!
--js

[Qemu-devel] [RFC 03/10] replace spinlock by QemuMutex.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

spinlock is only used in two cases:
  * cpu-exec.c: to protect TranslationBlock
  * mem_helper.c: for lock helper in target-i386 (which seems broken).

It's a pthread_mutex_t in user-mode so better using QemuMutex directly in this
case.
It allows as well to reuse tb_lock mutex of TBContext in case of multithread
TCG.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 cpu-exec.c   | 15 +++
 include/exec/exec-all.h  |  4 ++--
 linux-user/main.c|  6 +++---
 target-i386/mem_helper.c | 16 +---
 tcg/i386/tcg-target.c|  8 
 5 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index a4f0eff..1e7513c 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -335,7 +335,9 @@ int cpu_exec(CPUArchState *env)
 SyncClocks sc;
 
 /* This must be volatile so it is not trashed by longjmp() */
+#if defined(CONFIG_USER_ONLY)
 volatile bool have_tb_lock = false;
+#endif
 
 if (cpu-halted) {
 if (!cpu_has_work(cpu)) {
@@ -451,8 +453,10 @@ int cpu_exec(CPUArchState *env)
 cpu-exception_index = EXCP_INTERRUPT;
 cpu_loop_exit(cpu);
 }
-spin_lock(tcg_ctx.tb_ctx.tb_lock);
+#if defined(CONFIG_USER_ONLY)
+qemu_mutex_lock(tcg_ctx.tb_ctx.tb_lock);
 have_tb_lock = true;
+#endif
 tb = tb_find_fast(env);
 /* Note: we do it here to avoid a gcc bug on Mac OS X when
doing it in tb_find_slow */
@@ -474,9 +478,10 @@ int cpu_exec(CPUArchState *env)
 tb_add_jump((TranslationBlock *)(next_tb  ~TB_EXIT_MASK),
 next_tb  TB_EXIT_MASK, tb);
 }
+#if defined(CONFIG_USER_ONLY)
 have_tb_lock = false;
-spin_unlock(tcg_ctx.tb_ctx.tb_lock);
-
+qemu_mutex_unlock(tcg_ctx.tb_ctx.tb_lock);
+#endif
 /* cpu_interrupt might be called while translating the
TB, but before it is linked into a potentially
infinite loop and becomes env-current_tb. Avoid
@@ -549,10 +554,12 @@ int cpu_exec(CPUArchState *env)
 #ifdef TARGET_I386
 x86_cpu = X86_CPU(cpu);
 #endif
+#if defined(CONFIG_USER_ONLY)
 if (have_tb_lock) {
-spin_unlock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_unlock(tcg_ctx.tb_ctx.tb_lock);
 have_tb_lock = false;
 }
+#endif
 }
 } /* for(;;) */
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 6a15448..6a450c1 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -173,7 +173,7 @@ struct TranslationBlock {
 struct TranslationBlock *jmp_first;
 };
 
-#include exec/spinlock.h
+#include qemu/thread.h
 
 typedef struct TBContext TBContext;
 
@@ -183,7 +183,7 @@ struct TBContext {
 TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
 int nb_tbs;
 /* any access to the tbs or the page table must use this lock */
-spinlock_t tb_lock;
+QemuMutex tb_lock;
 
 /* statistics */
 int tb_flush_count;
diff --git a/linux-user/main.c b/linux-user/main.c
index 67b0231..97e7d50 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -107,7 +107,7 @@ static int pending_cpus;
 /* Make sure everything is in a consistent state for calling fork().  */
 void fork_start(void)
 {
-pthread_mutex_lock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_lock(tcg_ctx.tb_ctx.tb_lock);
 pthread_mutex_lock(exclusive_lock);
 mmap_fork_start();
 }
@@ -129,11 +129,11 @@ void fork_end(int child)
 pthread_mutex_init(cpu_list_mutex, NULL);
 pthread_cond_init(exclusive_cond, NULL);
 pthread_cond_init(exclusive_resume, NULL);
-pthread_mutex_init(tcg_ctx.tb_ctx.tb_lock, NULL);
+qemu_mutex_init(tcg_ctx.tb_ctx.tb_lock);
 gdbserver_fork((CPUArchState *)thread_cpu-env_ptr);
 } else {
 pthread_mutex_unlock(exclusive_lock);
-pthread_mutex_unlock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_unlock(tcg_ctx.tb_ctx.tb_lock);
 }
 }
 
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index 1aec8a5..7106cc3 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -23,17 +23,27 @@
 
 /* broken thread support */
 
-static spinlock_t global_cpu_lock = SPIN_LOCK_UNLOCKED;
+#if defined(CONFIG_USER_ONLY)
+QemuMutex global_cpu_lock;
 
 void helper_lock(void)
 {
-spin_lock(global_cpu_lock);
+qemu_mutex_lock(global_cpu_lock);
 }
 
 void helper_unlock(void)
 {
-spin_unlock(global_cpu_lock);
+qemu_mutex_unlock(global_cpu_lock);
 }
+#else
+void helper_lock(void)
+{
+}
+
+void helper_unlock(void)
+{
+}
+#endif
 
 void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
 {
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 4133dcf..cbb8a4f 100644
---

[Qemu-devel] [RFC 01/10] target-arm: protect cpu_exclusive_*.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

This adds a lock to avoid multiple exclusive access at the same time in case of
TCG multithread.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com

V1 - V2:
  Removed qemu_mutex_destroy().
---
 target-arm/cpu.c   | 14 ++
 target-arm/cpu.h   |  3 +++
 target-arm/helper.h|  3 +++
 target-arm/op_helper.c | 10 ++
 target-arm/translate.c |  6 ++
 5 files changed, 36 insertions(+)

diff --git a/target-arm/cpu.c b/target-arm/cpu.c
index 285947f..75bdc5b 100644
--- a/target-arm/cpu.c
+++ b/target-arm/cpu.c
@@ -31,6 +31,19 @@
 #include sysemu/kvm.h
 #include kvm_arm.h
 
+/* Protect cpu_exclusive_* variable .*/
+QemuMutex cpu_exclusive_lock;
+
+inline void arm_exclusive_lock(void)
+{
+qemu_mutex_lock(cpu_exclusive_lock);
+}
+
+inline void arm_exclusive_unlock(void)
+{
+qemu_mutex_unlock(cpu_exclusive_lock);
+}
+
 static void arm_cpu_set_pc(CPUState *cs, vaddr value)
 {
 ARMCPU *cpu = ARM_CPU(cs);
@@ -374,6 +387,7 @@ static void arm_cpu_initfn(Object *obj)
 cpu-psci_version = 2; /* TCG implements PSCI 0.2 */
 if (!inited) {
 inited = true;
+qemu_mutex_init(cpu_exclusive_lock);
 arm_translate_init();
 }
 }
diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index 7ba55f0..2101d85 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -1821,4 +1821,7 @@ enum {
 QEMU_PSCI_CONDUIT_HVC = 2,
 };
 
+void arm_exclusive_lock(void);
+void arm_exclusive_unlock(void);
+
 #endif
diff --git a/target-arm/helper.h b/target-arm/helper.h
index dec3728..ce07711 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -529,6 +529,9 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+DEF_HELPER_0(exclusive_lock, void)
+DEF_HELPER_0(exclusive_unlock, void)
+
 #ifdef TARGET_AARCH64
 #include helper-a64.h
 #endif
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index 2bed914..d830fd8 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -33,6 +33,16 @@ static void raise_exception(CPUARMState *env, int tt)
 cpu_loop_exit(cs);
 }
 
+void HELPER(exclusive_lock)(void)
+{
+arm_exclusive_lock();
+}
+
+void HELPER(exclusive_unlock)(void)
+{
+arm_exclusive_unlock();
+}
+
 uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
   uint32_t rn, uint32_t maxindex)
 {
diff --git a/target-arm/translate.c b/target-arm/translate.c
index bdfcdf1..513d151 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -7381,6 +7381,7 @@ static void gen_load_exclusive(DisasContext *s, int rt, 
int rt2,
 abort();
 }
 
+gen_helper_exclusive_lock();
 if (size == 3) {
 TCGv_i32 tmp2 = tcg_temp_new_i32();
 TCGv_i32 tmp3 = tcg_temp_new_i32();
@@ -7396,11 +7397,14 @@ static void gen_load_exclusive(DisasContext *s, int rt, 
int rt2,
 
 store_reg(s, rt, tmp);
 tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
+gen_helper_exclusive_unlock();
 }
 
 static void gen_clrex(DisasContext *s)
 {
+gen_helper_exclusive_lock();
 tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+gen_helper_exclusive_unlock();
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -7431,6 +7435,7 @@ static void gen_store_exclusive(DisasContext *s, int rd, 
int rt, int rt2,
 done_label = gen_new_label();
 extaddr = tcg_temp_new_i64();
 tcg_gen_extu_i32_i64(extaddr, addr);
+gen_helper_exclusive_lock();
 tcg_gen_brcond_i64(TCG_COND_NE, extaddr, cpu_exclusive_addr, fail_label);
 tcg_temp_free_i64(extaddr);
 
@@ -7495,6 +7500,7 @@ static void gen_store_exclusive(DisasContext *s, int rd, 
int rt, int rt2,
 tcg_gen_movi_i32(cpu_R[rd], 1);
 gen_set_label(done_label);
 tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+gen_helper_exclusive_unlock();
 }
 #endif
 
-- 
1.9.0

[Qemu-devel] [RFC 08/10] Drop global lock during TCG code execution

2015-01-16 Thread fred . konrad

From: Jan Kiszka jan.kis...@siemens.com

This finally allows TCG to benefit from the iothread introduction: Drop
the global mutex while running pure TCG CPU code. Reacquire the lock
when entering MMIO or PIO emulation, or when leaving the TCG loop.

We have to revert a few optimization for the current TCG threading
model, namely kicking the TCG thread in qemu_mutex_lock_iothread and not
kicking it in qemu_cpu_kick. We also need to disable RAM block
reordering until we have a more efficient locking mechanism at hand.

I'm pretty sure some cases are still broken, definitely SMP (we no
longer perform round-robin scheduling by chance). Still, a Linux x86
UP guest and my Musicpal ARM model boot fine here. These numbers
demonstrate where we gain something:

20338 jan   20   0  331m  75m 6904 R   99  0.9   0:50.95 qemu-system-arm
20337 jan   20   0  331m  75m 6904 S   20  0.9   0:26.50 qemu-system-arm

The guest CPU was fully loaded, but the iothread could still run mostly
independent on a second core. Without the patch we don't get beyond

32206 jan   20   0  330m  73m 7036 R   82  0.9   1:06.00 qemu-system-arm
32204 jan   20   0  330m  73m 7036 S   21  0.9   0:17.03 qemu-system-arm

We don't benefit significantly, though, when the guest is not fully
loading a host CPU.

Note that this patch depends on
http://thread.gmane.org/gmane.comp.emulators.qemu/118657

Changes from Fred Konrad:
  * Rebase on the current HEAD.
  * Fixes a deadlock in qemu_devices_reset().
---
 cpus.c| 19 +++
 cputlb.c  |  5 +
 exec.c| 25 +
 softmmu_template.h|  6 ++
 target-i386/misc_helper.c | 27 ---
 translate-all.c   |  2 ++
 vl.c  |  6 ++
 7 files changed, 75 insertions(+), 15 deletions(-)

diff --git a/cpus.c b/cpus.c
index 91a48f2..f10c94d 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1017,7 +1017,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 qemu_tcg_init_cpu_signals();
 qemu_thread_get_self(cpu-thread);
 
-qemu_mutex_lock(qemu_global_mutex);
+qemu_mutex_lock_iothread();
 CPU_FOREACH(cpu) {
 cpu-thread_id = qemu_get_thread_id();
 cpu-created = true;
@@ -1125,17 +1125,7 @@ static bool qemu_in_vcpu_thread(void)
 
 void qemu_mutex_lock_iothread(void)
 {
-if (!tcg_enabled()) {
-qemu_mutex_lock(qemu_global_mutex);
-} else {
-iothread_requesting_mutex = true;
-if (qemu_mutex_trylock(qemu_global_mutex)) {
-qemu_cpu_kick_thread(first_cpu);
-qemu_mutex_lock(qemu_global_mutex);
-}
-iothread_requesting_mutex = false;
-qemu_cond_broadcast(qemu_io_proceeded_cond);
-}
+qemu_mutex_lock(qemu_global_mutex);
 }
 
 void qemu_mutex_unlock_iothread(void)
@@ -1356,7 +1346,12 @@ static int tcg_cpu_exec(CPUArchState *env)
 cpu-icount_decr.u16.low = decr;
 cpu-icount_extra = count;
 }
+
+qemu_mutex_unlock_iothread();
+
 ret = cpu_exec(env);
+
+qemu_mutex_lock_iothread();
 #ifdef CONFIG_PROFILER
 qemu_time += profile_getclock() - ti;
 #endif
diff --git a/cputlb.c b/cputlb.c
index 3b271d4..4a7e634 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -30,6 +30,9 @@
 #include exec/ram_addr.h
 #include tcg/tcg.h
 
+void qemu_mutex_lock_iothread(void);
+void qemu_mutex_unlock_iothread(void);
+
 //#define DEBUG_TLB
 //#define DEBUG_TLB_CHECK
 
@@ -125,8 +128,10 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
can be detected */
 void tlb_protect_code(ram_addr_t ram_addr)
 {
+qemu_mutex_lock_iothread();
 cpu_physical_memory_reset_dirty(ram_addr, TARGET_PAGE_SIZE,
 DIRTY_MEMORY_CODE);
+qemu_mutex_unlock_iothread();
 }
 
 /* update the TLB so that writes in physical page 'phys_addr' are no longer
diff --git a/exec.c b/exec.c
index 081818e..705d451 100644
--- a/exec.c
+++ b/exec.c
@@ -1786,6 +1786,7 @@ static void check_watchpoint(int offset, int len, int 
flags)
 }
 wp-hitaddr = vaddr;
 if (!cpu-watchpoint_hit) {
+qemu_mutex_unlock_iothread();
 cpu-watchpoint_hit = wp;
 tb_check_watchpoint(cpu);
 if (wp-flags  BP_STOP_BEFORE_ACCESS) {
@@ -2557,6 +2558,7 @@ static inline uint32_t ldl_phys_internal(AddressSpace 
*as, hwaddr addr,
 mr = address_space_translate(as, addr, addr1, l, false);
 if (l  4 || !memory_access_is_direct(mr, false)) {
 /* I/O case */
+qemu_mutex_lock_iothread();
 io_mem_read(mr, addr1, val, 4);
 #if defined(TARGET_WORDS_BIGENDIAN)
 if (endian == DEVICE_LITTLE_ENDIAN) {
@@ -2567,6 +2569,7 @@ static inline uint32_t ldl_phys_internal(AddressSpace 
*as, hwaddr addr,
 val = bswap32(val);
 }
 #endif
+qemu_mutex_unlock_iothread();
 } else {
 /* RAM case */
 ptr =

[Qemu-devel] [RFC 04/10] remove unused spinlock.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

This just removes spinlock as it is not used anymore.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 include/exec/spinlock.h | 49 -
 scripts/checkpatch.pl   |  9 ++---
 2 files changed, 2 insertions(+), 56 deletions(-)
 delete mode 100644 include/exec/spinlock.h

diff --git a/include/exec/spinlock.h b/include/exec/spinlock.h
deleted file mode 100644
index a72edda..000
--- a/include/exec/spinlock.h
+++ /dev/null
@@ -1,49 +0,0 @@
-/*
- *  Copyright (c) 2003 Fabrice Bellard
- *
- * This library is free software; you can redistribute it and/or
- * modify it under the terms of the GNU Lesser General Public
- * License as published by the Free Software Foundation; either
- * version 2 of the License, or (at your option) any later version.
- *
- * This library is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * Lesser General Public License for more details.
- *
- * You should have received a copy of the GNU Lesser General Public
- * License along with this library; if not, see http://www.gnu.org/licenses/
- */
-
-/* configure guarantees us that we have pthreads on any host except
- * mingw32, which doesn't support any of the user-only targets.
- * So we can simply assume we have pthread mutexes here.
- */
-#if defined(CONFIG_USER_ONLY)
-
-#include pthread.h
-#define spin_lock pthread_mutex_lock
-#define spin_unlock pthread_mutex_unlock
-#define spinlock_t pthread_mutex_t
-#define SPIN_LOCK_UNLOCKED PTHREAD_MUTEX_INITIALIZER
-
-#else
-
-/* Empty implementations, on the theory that system mode emulation
- * is single-threaded. This means that these functions should only
- * be used from code run in the TCG cpu thread, and cannot protect
- * data structures which might also be accessed from the IO thread
- * or from signal handlers.
- */
-typedef int spinlock_t;
-#define SPIN_LOCK_UNLOCKED 0
-
-static inline void spin_lock(spinlock_t *lock)
-{
-}
-
-static inline void spin_unlock(spinlock_t *lock)
-{
-}
-
-#endif
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 053e432..e97b707 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2655,11 +2655,6 @@ sub process {
WARN(Use of volatile is usually wrong: see 
Documentation/volatile-considered-harmful.txt\n . $herecurr);
}
 
-# SPIN_LOCK_UNLOCKED  RW_LOCK_UNLOCKED are deprecated
-   if ($line =~ /\b(SPIN_LOCK_UNLOCKED|RW_LOCK_UNLOCKED)/) {
-   ERROR(Use of $1 is deprecated: see 
Documentation/spinlocks.txt\n . $herecurr);
-   }
-
 # warn about #if 0
if ($line =~ /^.\s*\#\s*if\s+0\b/) {
CHK(if this code is redundant consider removing it\n .
@@ -2708,8 +2703,8 @@ sub process {
ERROR(exactly one space required after that #$1\n . 
$herecurr);
}
 
-# check for spinlock_t definitions without a comment.
-   if ($line =~ /^.\s*(struct\s+mutex|spinlock_t)\s+\S+;/ ||
+# check for mutex definitions without a comment.
+   if ($line =~ /^.\s*(struct\s+mutex)\s+\S+;/ ||
$line =~ /^.\s*(DEFINE_MUTEX)\s*\(/) {
my $which = $1;
if (!ctx_has_comment($first_line, $linenr)) {
-- 
1.9.0

[Qemu-devel] [RFC 07/10] tcg: remove tcg_halt_cond global variable.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

This removes tcg_halt_cond global variable.
We need one QemuCond per virtual cpu for multithread TCG.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 cpus.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/cpus.c b/cpus.c
index 2edb5cd..91a48f2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -796,7 +796,6 @@ static bool iothread_requesting_mutex;
 static QemuThread io_thread;
 
 static QemuThread *tcg_cpu_thread;
-static QemuCond *tcg_halt_cond;
 
 /* cpu creation */
 static QemuCond qemu_cpu_cond;
@@ -902,15 +901,13 @@ static void qemu_wait_io_event_common(CPUState *cpu)
 cpu-thread_kicked = false;
 }
 
-static void qemu_tcg_wait_io_event(void)
+static void qemu_tcg_wait_io_event(CPUState *cpu)
 {
-CPUState *cpu;
-
 while (all_cpu_threads_idle()) {
/* Start accounting real time to the virtual clock if the CPUs
   are idle.  */
 qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
-qemu_cond_wait(tcg_halt_cond, qemu_global_mutex);
+qemu_cond_wait(cpu-halt_cond, qemu_global_mutex);
 }
 
 while (iothread_requesting_mutex) {
@@ -1030,7 +1027,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 
 /* wait for initial kick-off after machine start */
 while (QTAILQ_FIRST(cpus)-stopped) {
-qemu_cond_wait(tcg_halt_cond, qemu_global_mutex);
+qemu_cond_wait(QTAILQ_FIRST(cpus)-halt_cond, qemu_global_mutex);
 
 /* process any pending work */
 CPU_FOREACH(cpu) {
@@ -1048,7 +1045,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
 }
 }
-qemu_tcg_wait_io_event();
+qemu_tcg_wait_io_event(QTAILQ_FIRST(cpus));
 }
 
 return NULL;
@@ -1214,12 +1211,12 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
 
 tcg_cpu_address_space_init(cpu, cpu-as);
 
+cpu-halt_cond = g_malloc0(sizeof(QemuCond));
+qemu_cond_init(cpu-halt_cond);
+
 /* share a single thread for all cpus with TCG */
 if (!tcg_cpu_thread) {
 cpu-thread = g_malloc0(sizeof(QemuThread));
-cpu-halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu-halt_cond);
-tcg_halt_cond = cpu-halt_cond;
 snprintf(thread_name, VCPU_THREAD_NAME_SIZE, CPU %d/TCG,
  cpu-cpu_index);
 qemu_thread_create(cpu-thread, thread_name, qemu_tcg_cpu_thread_fn,
@@ -1233,7 +1230,6 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
 tcg_cpu_thread = cpu-thread;
 } else {
 cpu-thread = tcg_cpu_thread;
-cpu-halt_cond = tcg_halt_cond;
 }
 }
 
-- 
1.9.0

[Qemu-devel] [RFC 05/10] extract TBContext from TCGContext.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

In order to have one TCGContext per thread and a single TBContext we have to
extract TBContext from TCGContext.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 cpu-exec.c| 18 ++---
 linux-user/main.c |  6 ++---
 tcg/tcg.h |  3 +--
 translate-all.c   | 79 +++
 4 files changed, 52 insertions(+), 54 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 1e7513c..4d22252 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -237,13 +237,13 @@ static TranslationBlock *tb_find_slow(CPUArchState *env,
 tb_page_addr_t phys_pc, phys_page1;
 target_ulong virt_page2;
 
-tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
+tb_ctx.tb_invalidated_flag = 0;
 
 /* find translated block using physical mappings */
 phys_pc = get_page_addr_code(env, pc);
 phys_page1 = phys_pc  TARGET_PAGE_MASK;
 h = tb_phys_hash_func(phys_pc);
-ptb1 = tcg_ctx.tb_ctx.tb_phys_hash[h];
+ptb1 = tb_ctx.tb_phys_hash[h];
 for(;;) {
 tb = *ptb1;
 if (!tb)
@@ -275,8 +275,8 @@ static TranslationBlock *tb_find_slow(CPUArchState *env,
 /* Move the last found TB to the head of the list */
 if (likely(*ptb1)) {
 *ptb1 = tb-phys_hash_next;
-tb-phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
-tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
+tb-phys_hash_next = tb_ctx.tb_phys_hash[h];
+tb_ctx.tb_phys_hash[h] = tb;
 }
 /* we add the TB in the virtual pc hash table */
 cpu-tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
@@ -454,18 +454,18 @@ int cpu_exec(CPUArchState *env)
 cpu_loop_exit(cpu);
 }
 #if defined(CONFIG_USER_ONLY)
-qemu_mutex_lock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_lock(tb_ctx.tb_lock);
 have_tb_lock = true;
 #endif
 tb = tb_find_fast(env);
 /* Note: we do it here to avoid a gcc bug on Mac OS X when
doing it in tb_find_slow */
-if (tcg_ctx.tb_ctx.tb_invalidated_flag) {
+if (tb_ctx.tb_invalidated_flag) {
 /* as some TB could have been invalidated because
of memory exceptions while generating the code, we
must recompute the hash index here */
 next_tb = 0;
-tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
+tb_ctx.tb_invalidated_flag = 0;
 }
 if (qemu_loglevel_mask(CPU_LOG_EXEC)) {
 qemu_log(Trace %p [ TARGET_FMT_lx ] %s\n,
@@ -480,7 +480,7 @@ int cpu_exec(CPUArchState *env)
 }
 #if defined(CONFIG_USER_ONLY)
 have_tb_lock = false;
-qemu_mutex_unlock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_unlock(tb_ctx.tb_lock);
 #endif
 /* cpu_interrupt might be called while translating the
TB, but before it is linked into a potentially
@@ -556,7 +556,7 @@ int cpu_exec(CPUArchState *env)
 #endif
 #if defined(CONFIG_USER_ONLY)
 if (have_tb_lock) {
-qemu_mutex_unlock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_unlock(tb_ctx.tb_lock);
 have_tb_lock = false;
 }
 #endif
diff --git a/linux-user/main.c b/linux-user/main.c
index 97e7d50..2a4c948 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -107,7 +107,7 @@ static int pending_cpus;
 /* Make sure everything is in a consistent state for calling fork().  */
 void fork_start(void)
 {
-qemu_mutex_lock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_lock(tb_ctx.tb_lock);
 pthread_mutex_lock(exclusive_lock);
 mmap_fork_start();
 }
@@ -129,11 +129,11 @@ void fork_end(int child)
 pthread_mutex_init(cpu_list_mutex, NULL);
 pthread_cond_init(exclusive_cond, NULL);
 pthread_cond_init(exclusive_resume, NULL);
-qemu_mutex_init(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_init(tb_ctx.tb_lock);
 gdbserver_fork((CPUArchState *)thread_cpu-env_ptr);
 } else {
 pthread_mutex_unlock(exclusive_lock);
-qemu_mutex_unlock(tcg_ctx.tb_ctx.tb_lock);
+qemu_mutex_unlock(tb_ctx.tb_lock);
 }
 }
 
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 944b877..baf053a 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -529,13 +529,12 @@ struct TCGContext {
 size_t code_gen_buffer_max_size;
 void *code_gen_ptr;
 
-TBContext tb_ctx;
-
 /* The TCGBackendData structure is private to tcg-target.c.  */
 struct TCGBackendData *be;
 };
 
 extern TCGContext tcg_ctx;
+extern TBContext tb_ctx;
 
 /* pool based memory allocation */
 
diff --git a/translate-all.c b/translate-all.c
index 0e11c70..e393d30 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -127,6 +127,9 @@ static void *l1_map[V_L1_SIZE];
 /* code generation context */
 TCGContext tcg_ctx;
 
+/*

[Qemu-devel] [RFC 09/10] cpu: remove exit_request global.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

This removes exit_request global and adds a variable in CPUState for this.
Only the flag for the first cpu is used for the moment as we are still with one
TCG thread.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 cpu-exec.c|  4 +---
 cpus.c| 12 +---
 include/qom/cpu.h |  1 +
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 68654e3..3ba2a7a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -327,8 +327,6 @@ static void cpu_handle_debug_exception(CPUArchState *env)
 
 /* main execution loop */
 
-volatile sig_atomic_t exit_request;
-
 int cpu_exec(CPUArchState *env)
 {
 CPUState *cpu = ENV_GET_CPU(env);
@@ -365,7 +363,7 @@ int cpu_exec(CPUArchState *env)
  * an instruction scheduling constraint on modern architectures.  */
 smp_mb();
 
-if (unlikely(exit_request)) {
+if (unlikely(cpu-exit_loop_request)) {
 cpu-exit_request = 1;
 }
 
diff --git a/cpus.c b/cpus.c
index f10c94d..8ae70c2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -646,10 +646,14 @@ static void cpu_handle_guest_debug(CPUState *cpu)
 
 static void cpu_signal(int sig)
 {
+CPUState *cpu;
 if (current_cpu) {
 cpu_exit(current_cpu);
 }
-exit_request = 1;
+
+CPU_FOREACH(cpu) {
+cpu-exit_loop_request = 1;
+}
 }
 
 #ifdef CONFIG_LINUX
@@ -1376,7 +1380,8 @@ static void tcg_exec_all(void)
 if (next_cpu == NULL) {
 next_cpu = first_cpu;
 }
-for (; next_cpu != NULL  !exit_request; next_cpu = CPU_NEXT(next_cpu)) {
+for (; next_cpu != NULL  !first_cpu-exit_loop_request;
+   next_cpu = CPU_NEXT(next_cpu)) {
 CPUState *cpu = next_cpu;
 CPUArchState *env = cpu-env_ptr;
 
@@ -1393,7 +1398,8 @@ static void tcg_exec_all(void)
 break;
 }
 }
-exit_request = 0;
+
+first_cpu-exit_loop_request = 0;
 }
 
 void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 2098f1c..a2e3208 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -249,6 +249,7 @@ struct CPUState {
 bool created;
 bool stop;
 bool stopped;
+volatile sig_atomic_t exit_loop_request;
 volatile sig_atomic_t exit_request;
 uint32_t interrupt_request;
 int singlestep_enabled;
-- 
1.9.0

[Qemu-devel] [RFC 02/10] use a different translation block list for each cpu.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

We need a different TranslationBlock list for each core in case of multithread
TCG.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 translate-all.c | 40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 8fa4378..0e11c70 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -72,10 +72,11 @@
 #endif
 
 #define SMC_BITMAP_USE_THRESHOLD 10
+#define MAX_CPUS 256
 
 typedef struct PageDesc {
 /* list of TBs intersecting this ram page */
-TranslationBlock *first_tb;
+TranslationBlock *first_tb[MAX_CPUS];
 /* in order to optimize self modifying code, we count the number
of lookups we do to a given page to use a bitmap */
 unsigned int code_write_count;
@@ -750,7 +751,7 @@ static inline void invalidate_page_bitmap(PageDesc *p)
 /* Set to NULL all the 'first_tb' fields in all PageDescs. */
 static void page_flush_tb_1(int level, void **lp)
 {
-int i;
+int i, j;
 
 if (*lp == NULL) {
 return;
@@ -759,7 +760,9 @@ static void page_flush_tb_1(int level, void **lp)
 PageDesc *pd = *lp;
 
 for (i = 0; i  V_L2_SIZE; ++i) {
-pd[i].first_tb = NULL;
+for (j = 0; j  MAX_CPUS; j++) {
+pd[i].first_tb[j] = NULL;
+}
 invalidate_page_bitmap(pd + i);
 }
 } else {
@@ -937,12 +940,12 @@ void tb_phys_invalidate(TranslationBlock *tb, 
tb_page_addr_t page_addr)
 /* remove the TB from the page list */
 if (tb-page_addr[0] != page_addr) {
 p = page_find(tb-page_addr[0]  TARGET_PAGE_BITS);
-tb_page_remove(p-first_tb, tb);
+tb_page_remove(p-first_tb[current_cpu-cpu_index], tb);
 invalidate_page_bitmap(p);
 }
 if (tb-page_addr[1] != -1  tb-page_addr[1] != page_addr) {
 p = page_find(tb-page_addr[1]  TARGET_PAGE_BITS);
-tb_page_remove(p-first_tb, tb);
+tb_page_remove(p-first_tb[current_cpu-cpu_index], tb);
 invalidate_page_bitmap(p);
 }
 
@@ -1012,7 +1015,7 @@ static void build_page_bitmap(PageDesc *p)
 
 p-code_bitmap = g_malloc0(TARGET_PAGE_SIZE / 8);
 
-tb = p-first_tb;
+tb = p-first_tb[current_cpu-cpu_index];
 while (tb != NULL) {
 n = (uintptr_t)tb  3;
 tb = (TranslationBlock *)((uintptr_t)tb  ~3);
@@ -1138,7 +1141,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, 
tb_page_addr_t end,
 /* we remove all the TBs in the range [start, end[ */
 /* XXX: see if in some cases it could be faster to invalidate all
the code */
-tb = p-first_tb;
+tb = p-first_tb[cpu-cpu_index];
 while (tb != NULL) {
 n = (uintptr_t)tb  3;
 tb = (TranslationBlock *)((uintptr_t)tb  ~3);
@@ -1196,7 +1199,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, 
tb_page_addr_t end,
 }
 #if !defined(CONFIG_USER_ONLY)
 /* if no code remaining, no need to continue to use slow writes */
-if (!p-first_tb) {
+if (!p-first_tb[cpu-cpu_index]) {
 invalidate_page_bitmap(p);
 if (is_cpu_write_access) {
 tlb_unprotect_code_phys(cpu, start, cpu-mem_io_vaddr);
@@ -1224,10 +1227,10 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, 
int len)
 #if 0
 if (1) {
 qemu_log(modifying code at 0x%x size=%d EIP=%x PC=%08x\n,
-  cpu_single_env-mem_io_vaddr, len,
-  cpu_single_env-eip,
-  cpu_single_env-eip +
-  (intptr_t)cpu_single_env-segs[R_CS].base);
+  current_cpu-mem_io_vaddr, len,
+  current_cpu-eip,
+  current_cpu-eip +
+  (intptr_t)current_cpu-segs[R_CS].base);
 }
 #endif
 p = page_find(start  TARGET_PAGE_BITS);
@@ -1269,7 +1272,7 @@ static void tb_invalidate_phys_page(tb_page_addr_t addr,
 if (!p) {
 return;
 }
-tb = p-first_tb;
+tb = p-first_tb[current_cpu-cpu_index];
 #ifdef TARGET_HAS_PRECISE_SMC
 if (tb  pc != 0) {
 current_tb = tb_find_pc(pc);
@@ -1299,7 +1302,7 @@ static void tb_invalidate_phys_page(tb_page_addr_t addr,
 tb_phys_invalidate(tb, addr);
 tb = tb-page_next[n];
 }
-p-first_tb = NULL;
+p-first_tb[current_cpu-cpu_index] = NULL;
 #ifdef TARGET_HAS_PRECISE_SMC
 if (current_tb_modified) {
 /* we generate a block containing just the instruction
@@ -1327,11 +1330,12 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 
 tb-page_addr[n] = page_addr;
 p = page_find_alloc(page_addr  TARGET_PAGE_BITS, 1);
-tb-page_next[n] = p-first_tb;
+tb-page_next[n] = p-first_tb[current_cpu-cpu_index];
 #ifndef CONFIG_USER_ONLY
-page_already_protected = p-first_tb != NULL;
+page_already_protected = p-first_tb[current_cpu-cpu_index] != NULL;
 #endif
-p-first_tb = (TranslationBlock *)((uintptr_t)tb | n);
+

[Qemu-devel] [RFC 10/10] tcg: switch on multithread.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

This switches on multithread.

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 cpus.c | 85 +++---
 1 file changed, 30 insertions(+), 55 deletions(-)

diff --git a/cpus.c b/cpus.c
index 8ae70c2..ab327dd 100644
--- a/cpus.c
+++ b/cpus.c
@@ -64,7 +64,6 @@
 
 #endif /* CONFIG_LINUX */
 
-static CPUState *next_cpu;
 int64_t max_delay;
 int64_t max_advance;
 
@@ -799,8 +798,6 @@ static bool iothread_requesting_mutex;
 
 static QemuThread io_thread;
 
-static QemuThread *tcg_cpu_thread;
-
 /* cpu creation */
 static QemuCond qemu_cpu_cond;
 /* system init */
@@ -907,10 +904,12 @@ static void qemu_wait_io_event_common(CPUState *cpu)
 
 static void qemu_tcg_wait_io_event(CPUState *cpu)
 {
-while (all_cpu_threads_idle()) {
+while (cpu_thread_is_idle(cpu)) {
/* Start accounting real time to the virtual clock if the CPUs
   are idle.  */
-qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
+if ((all_cpu_threads_idle())  (cpu-cpu_index == 0)) {
+  qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
+}
 qemu_cond_wait(cpu-halt_cond, qemu_global_mutex);
 }
 
@@ -918,9 +917,7 @@ static void qemu_tcg_wait_io_event(CPUState *cpu)
 qemu_cond_wait(qemu_io_proceeded_cond, qemu_global_mutex);
 }
 
-CPU_FOREACH(cpu) {
-qemu_wait_io_event_common(cpu);
-}
+qemu_wait_io_event_common(cpu);
 }
 
 static void qemu_kvm_wait_io_event(CPUState *cpu)
@@ -1012,7 +1009,7 @@ static void *qemu_dummy_cpu_thread_fn(void *arg)
 #endif
 }
 
-static void tcg_exec_all(void);
+static void tcg_exec_all(CPUState *cpu);
 
 static void *qemu_tcg_cpu_thread_fn(void *arg)
 {
@@ -1022,34 +1019,26 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 qemu_thread_get_self(cpu-thread);
 
 qemu_mutex_lock_iothread();
-CPU_FOREACH(cpu) {
-cpu-thread_id = qemu_get_thread_id();
-cpu-created = true;
-cpu-can_do_io = 1;
-}
-qemu_cond_signal(qemu_cpu_cond);
-
-/* wait for initial kick-off after machine start */
-while (QTAILQ_FIRST(cpus)-stopped) {
-qemu_cond_wait(QTAILQ_FIRST(cpus)-halt_cond, qemu_global_mutex);
+cpu-thread_id = qemu_get_thread_id();
+cpu-created = true;
+cpu-can_do_io = 1;
 
-/* process any pending work */
-CPU_FOREACH(cpu) {
-qemu_wait_io_event_common(cpu);
-}
-}
+qemu_cond_signal(qemu_cpu_cond);
 
 while (1) {
-tcg_exec_all();
+if (!cpu-stopped) {
+tcg_exec_all(cpu);
 
-if (use_icount) {
-int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
+if (use_icount) {
+int64_t deadline =
+qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
 
-if (deadline == 0) {
-qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
+if (deadline == 0) {
+qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
+}
 }
 }
-qemu_tcg_wait_io_event(QTAILQ_FIRST(cpus));
+qemu_tcg_wait_io_event(cpu);
 }
 
 return NULL;
@@ -1207,23 +1196,15 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
 
 cpu-halt_cond = g_malloc0(sizeof(QemuCond));
 qemu_cond_init(cpu-halt_cond);
-
-/* share a single thread for all cpus with TCG */
-if (!tcg_cpu_thread) {
-cpu-thread = g_malloc0(sizeof(QemuThread));
-snprintf(thread_name, VCPU_THREAD_NAME_SIZE, CPU %d/TCG,
- cpu-cpu_index);
-qemu_thread_create(cpu-thread, thread_name, qemu_tcg_cpu_thread_fn,
-   cpu, QEMU_THREAD_JOINABLE);
+cpu-thread = g_malloc0(sizeof(QemuThread));
+snprintf(thread_name, VCPU_THREAD_NAME_SIZE, CPU %d/TCG, cpu-cpu_index);
+qemu_thread_create(cpu-thread, thread_name, qemu_tcg_cpu_thread_fn, cpu,
+   QEMU_THREAD_JOINABLE);
 #ifdef _WIN32
-cpu-hThread = qemu_thread_get_handle(cpu-thread);
+cpu-hThread = qemu_thread_get_handle(cpu-thread);
 #endif
-while (!cpu-created) {
-qemu_cond_wait(qemu_cpu_cond, qemu_global_mutex);
-}
-tcg_cpu_thread = cpu-thread;
-} else {
-cpu-thread = tcg_cpu_thread;
+while (!cpu-created) {
+qemu_cond_wait(qemu_cpu_cond, qemu_global_mutex);
 }
 }
 
@@ -1370,21 +1351,15 @@ static int tcg_cpu_exec(CPUArchState *env)
 return ret;
 }
 
-static void tcg_exec_all(void)
+static void tcg_exec_all(CPUState *cpu)
 {
 int r;
+CPUArchState *env = cpu-env_ptr;
 
 /* Account partial waits to QEMU_CLOCK_VIRTUAL.  */
 qemu_clock_warp(QEMU_CLOCK_VIRTUAL);
 
-if (next_cpu == NULL) {
-next_cpu = first_cpu;
-}
-for (; next_cpu != NULL  !first_cpu-exit_loop_request;
-   next_cpu = CPU_NEXT(next_cpu)) {
-CPUState *cpu = next_cpu;
-CPUArchState *env = cpu-env_ptr;
-
+while

[Qemu-devel] [RFC 00/10] MultiThread TCG.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

Hi everybody,

This is the start of our work on TCG multithread.

We send it for comment to be sure we are taking the right direction.
We already discussed the first patch but we keep it for simplicity.

We choice to keep a common tbs array for all VCPU but protect it with the
tb_lock from TBContext. Then for each PageDesc we have a tb list per VCPU.

Known issues:
  * Some random deadlock.
  * qemu_pause_cond is broken we can't quit QEMU.
  * tb_flush is broken we must make sure no VCPU are executing code.

Jan Kiszka (1):
  Drop global lock during TCG code execution

KONRAD Frederic (9):
  target-arm: protect cpu_exclusive_*.
  use a different translation block list for each cpu.
  replace spinlock by QemuMutex.
  remove unused spinlock.
  extract TBContext from TCGContext.
  protect TBContext with tb_lock.
  tcg: remove tcg_halt_cond global variable.
  cpu: remove exit_request global.
  tcg: switch on multithread.

 cpu-exec.c|  47 +++
 cpus.c| 122 +++
 cputlb.c  |   5 ++
 exec.c|  25 ++
 include/exec/exec-all.h   |   4 +-
 include/exec/spinlock.h   |  49 ---
 include/qom/cpu.h |   1 +
 linux-user/main.c |   6 +-
 scripts/checkpatch.pl |   9 +-
 softmmu_template.h|   6 ++
 target-arm/cpu.c  |  14 
 target-arm/cpu.h  |   3 +
 target-arm/helper.h   |   3 +
 target-arm/op_helper.c|  10 +++
 target-arm/translate.c|   6 ++
 target-i386/mem_helper.c  |  16 +++-
 target-i386/misc_helper.c |  27 +-
 tcg/i386/tcg-target.c |   8 ++
 tcg/tcg.h |   3 +-
 translate-all.c   | 208 +++---
 vl.c  |   6 ++
 21 files changed, 335 insertions(+), 243 deletions(-)
 delete mode 100644 include/exec/spinlock.h

-- 
1.9.0

[Qemu-devel] [RFC 06/10] protect TBContext with tb_lock.

2015-01-16 Thread fred . konrad

From: KONRAD Frederic fred.kon...@greensocs.com

This protects TBContext with tb_lock to make tb_* thread safe.

We can still have issue with tb_flush in case of multithread TCG:
  An other CPU can be executing code during a flush.

This can be fixed later by making all other TCG thread exiting before calling
tb_flush().

Signed-off-by: KONRAD Frederic fred.kon...@greensocs.com
---
 cpu-exec.c  |  16 -
 translate-all.c | 100 +---
 2 files changed, 82 insertions(+), 34 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 4d22252..68654e3 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -237,6 +237,7 @@ static TranslationBlock *tb_find_slow(CPUArchState *env,
 tb_page_addr_t phys_pc, phys_page1;
 target_ulong virt_page2;
 
+qemu_mutex_lock(tb_ctx.tb_lock);
 tb_ctx.tb_invalidated_flag = 0;
 
 /* find translated block using physical mappings */
@@ -268,8 +269,14 @@ static TranslationBlock *tb_find_slow(CPUArchState *env,
 ptb1 = tb-phys_hash_next;
 }
  not_found:
+   /*
+* FIXME: We need to release this mutex because tb_gen_code needs it.
+* This can be optimised by adding a flag to tb_gen_code?
+*/
+   qemu_mutex_unlock(tb_ctx.tb_lock);
/* if no translated code available, then translate it now */
-tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+   tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+   qemu_mutex_lock(tb_ctx.tb_lock);
 
  found:
 /* Move the last found TB to the head of the list */
@@ -280,6 +287,7 @@ static TranslationBlock *tb_find_slow(CPUArchState *env,
 }
 /* we add the TB in the virtual pc hash table */
 cpu-tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
+qemu_mutex_unlock(tb_ctx.tb_lock);
 return tb;
 }
 
@@ -460,6 +468,9 @@ int cpu_exec(CPUArchState *env)
 tb = tb_find_fast(env);
 /* Note: we do it here to avoid a gcc bug on Mac OS X when
doing it in tb_find_slow */
+#if !defined(CONFIG_USER_ONLY)
+qemu_mutex_lock(tb_ctx.tb_lock);
+#endif
 if (tb_ctx.tb_invalidated_flag) {
 /* as some TB could have been invalidated because
of memory exceptions while generating the code, we
@@ -467,6 +478,9 @@ int cpu_exec(CPUArchState *env)
 next_tb = 0;
 tb_ctx.tb_invalidated_flag = 0;
 }
+#if !defined(CONFIG_USER_ONLY)
+qemu_mutex_unlock(tb_ctx.tb_lock);
+#endif
 if (qemu_loglevel_mask(CPU_LOG_EXEC)) {
 qemu_log(Trace %p [ TARGET_FMT_lx ] %s\n,
  tb-tc_ptr, tb-pc, lookup_symbol(tb-pc));
diff --git a/translate-all.c b/translate-all.c
index e393d30..68505c0 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -689,6 +689,7 @@ static inline void code_gen_alloc(size_t tb_size)
 CODE_GEN_AVG_BLOCK_SIZE;
 tb_ctx.tbs = g_malloc(tcg_ctx.code_gen_max_blocks
   * sizeof(TranslationBlock));
+qemu_mutex_init(tb_ctx.tb_lock);
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
@@ -713,20 +714,23 @@ bool tcg_enabled(void)
 return tcg_ctx.code_gen_buffer != NULL;
 }
 
-/* Allocate a new translation block. Flush the translation buffer if
-   too many translation blocks or too much generated code. */
+/*
+ * Allocate a new translation block. Flush the translation buffer if
+ * too many translation blocks or too much generated code.
+ * tb_alloc is not thread safe but tb_gen_code is protected by a mutex so this
+ * function is called only by one thread.
+ */
 static TranslationBlock *tb_alloc(target_ulong pc)
 {
-TranslationBlock *tb;
+TranslationBlock *tb = NULL;
 
-if (tb_ctx.nb_tbs = tcg_ctx.code_gen_max_blocks ||
-(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) =
+if (tb_ctx.nb_tbs  tcg_ctx.code_gen_max_blocks 
+(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) 
  tcg_ctx.code_gen_buffer_max_size) {
-return NULL;
+tb = tb_ctx.tbs[tb_ctx.nb_tbs++];
+tb-pc = pc;
+tb-cflags = 0;
 }
-tb = tb_ctx.tbs[tb_ctx.nb_tbs++];
-tb-pc = pc;
-tb-cflags = 0;
 return tb;
 }
 
@@ -735,11 +739,16 @@ void tb_free(TranslationBlock *tb)
 /* In practice this is mostly used for single use temporary TB
Ignore the hard cases and just back up if this TB happens to
be the last one generated.  */
+
+qemu_mutex_lock(tb_ctx.tb_lock);
+
 if (tb_ctx.nb_tbs  0 
 tb == tb_ctx.tbs[tb_ctx.nb_tbs - 1]) {
 tcg_ctx.code_gen_ptr = tb-tc_ptr;
 tb_ctx.nb_tbs--;
 }
+
+qemu_mutex_unlock(tb_ctx.tb_lock);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -792,6 +801,8 @@ void tb_flush(CPUArchState *env1)
 {
 CPUState *cpu = ENV_GET_CPU(env1);
 
+qemu_mutex_lock(tb_ctx.tb_lock);
+
 #if defined(DEBUG_FLUSH)
 printf(qemu:

Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery

2015-01-16 Thread John Snow




On 12/09/2014 01:15 PM, Dr. David Alan Gilbert (git) wrote:

From: Dr. David Alan Gilbert dgilb...@redhat.com

(With the previous atapi_dma flag recovery)
If migration happens between the ATAPI command being written and the
bmdma being started, the DMA is dropped.  Eventually the guest times
out and recovers, but that can take many seconds.
(This is rare, on a pingpong reading the CD continuously I hit
this about ~1/30-1/50 migrates)

I don't think we've got enough state to be able to recover safely
at this point, so I throw a 'medium error, no seek complete'
that I'm assuming guests will try and recover from an apparently
dirty CD.

OK, it's a hack, the real solution is probably to push a lot of
ATAPI state into the migration stream, but this is a fix that
works with no stream changes. Tested only on Linux (both RHEL5
(pre-libata) and RHEL7).

Signed-off-by: Dr. David Alan Gilbert dgilb...@redhat.com
---
  hw/ide/atapi.c| 17 +
  hw/ide/internal.h |  2 ++
  hw/ide/pci.c  | 11 +++
  3 files changed, 30 insertions(+)

diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
index c63b7e5..e17799c 100644
--- a/hw/ide/atapi.c
+++ b/hw/ide/atapi.c
@@ -394,6 +394,23 @@ static void ide_atapi_cmd_read(IDEState *s, int lba, int 
nb_sectors,
  }
  }

+
+/* Called by *_restart_bh when the transfer function points
+ * to ide_atapi_cmd
+ */
+void ide_atapi_dma_restart(IDEState *s)
+{
+/*
+ * I'm not sure we have enough stored to restart the command
+ * safely, so give the guest an error it should recover from.
+ * I'm assuming most guests will try to recover from something
+ * listed as a medium error on a CD; it seems to work on Linux.
+ * This would be more of a problem if we did any other type of
+ * DMA operation.
+ */
+ide_atapi_cmd_error(s, MEDIUM_ERROR, ASC_NO_SEEK_COMPLETE);
+}
+
  static inline uint8_t ide_atapi_set_profile(uint8_t *buf, uint8_t *index,
  uint16_t profile)
  {
diff --git a/hw/ide/internal.h b/hw/ide/internal.h
index 8a3eca4..8b65285 100644
--- a/hw/ide/internal.h
+++ b/hw/ide/internal.h
@@ -289,6 +289,7 @@ typedef struct IDEDMAOps IDEDMAOps;
  #define ATAPI_INT_REASON_TAG0xf8

  /* same constants as bochs */
+#define ASC_NO_SEEK_COMPLETE 0x02
  #define ASC_ILLEGAL_OPCODE   0x20
  #define ASC_LOGICAL_BLOCK_OOR0x21
  #define ASC_INV_FIELD_IN_CMD_PACKET  0x24
@@ -529,6 +530,7 @@ void ide_dma_error(IDEState *s);

  void ide_atapi_cmd_ok(IDEState *s);
  void ide_atapi_cmd_error(IDEState *s, int sense_key, int asc);
+void ide_atapi_dma_restart(IDEState *s);
  void ide_atapi_io_error(IDEState *s, int ret);

  void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val);
diff --git a/hw/ide/pci.c b/hw/ide/pci.c
index bee5ad3..e3f2054 100644
--- a/hw/ide/pci.c
+++ b/hw/ide/pci.c
@@ -235,6 +235,17 @@ static void bmdma_restart_bh(void *opaque)
  }
  } else if (error_status  IDE_RETRY_FLUSH) {
  ide_flush_cache(bmdma_active_if(bm));
+} else {
+IDEState *s = bmdma_active_if(bm);
+
+/*
+ * We've not got any bits to tell us about ATAPI - but
+ * we do have the end_transfer_func that tells us what
+ * we're trying to do.
+ */
+if (s-end_transfer_func == ide_atapi_cmd) {
+ide_atapi_dma_restart(s);
+}
  }
  }




I think this workaround is sensible until we develop a more precise fix.

Reviewed-by: John Snow js...@redhat.com

Re: [Qemu-devel] [PATCH v11 09/13] qmp: Add support of dirty-bitmap sync mode for drive-backup

2015-01-16 Thread Max Reitz


On 2015-01-12 at 11:31, John Snow wrote:

For dirty-bitmap sync mode, the block job will iterate through the
given dirty bitmap to decide if a sector needs backup (backup all the
dirty clusters and skip clean ones), just as allocation conditions of
top sync mode.

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   |   5 ++
  block/backup.c| 120 ++
  block/mirror.c|   4 ++
  blockdev.c|  14 +-
  hmp.c |   3 +-
  include/block/block.h |   1 +
  include/block/block_int.h |   2 +
  qapi/block-core.json  |  11 +++--
  qmp-commands.hx   |   7 +--
  9 files changed, 137 insertions(+), 30 deletions(-)


Since you seem to be intending to rethink the frozen state, I'm just 
scanning through the series from patch 8 on.


While this patch doesn't seem to have changed much conceptually since 
the last version I reviewed, with it applied to master, qemu fails to 
build due to Fam's blockdev-backup series (some new calls to 
backup_start() which have to be adapted).


Max

Re: [Qemu-devel] [PATCH v11 09/13] qmp: Add support of dirty-bitmap sync mode for drive-backup

2015-01-16 Thread John Snow




On 01/16/2015 12:52 PM, Max Reitz wrote:

On 2015-01-12 at 11:31, John Snow wrote:

For dirty-bitmap sync mode, the block job will iterate through the
given dirty bitmap to decide if a sector needs backup (backup all the
dirty clusters and skip clean ones), just as allocation conditions of
top sync mode.

Signed-off-by: Fam Zheng f...@redhat.com
Signed-off-by: John Snow js...@redhat.com
---
  block.c   |   5 ++
  block/backup.c| 120
++
  block/mirror.c|   4 ++
  blockdev.c|  14 +-
  hmp.c |   3 +-
  include/block/block.h |   1 +
  include/block/block_int.h |   2 +
  qapi/block-core.json  |  11 +++--
  qmp-commands.hx   |   7 +--
  9 files changed, 137 insertions(+), 30 deletions(-)


Since you seem to be intending to rethink the frozen state, I'm just
scanning through the series from patch 8 on.

While this patch doesn't seem to have changed much conceptually since
the last version I reviewed, with it applied to master, qemu fails to
build due to Fam's blockdev-backup series (some new calls to
backup_start() which have to be adapted).

Max


Understood. Frozen will still exist largely similar to what it doe s 
now, but the proposal from fam was to simply fold enabled/disabled 
into the same status, and perhaps differentiate between a simple 
disabled state and a frozen state by the presence or absence of a 
successor.


As for the rebase onto master, here's a v11.5:
https://github.com/jnsnow/qemu/commits/dbm-backup

--
—js

[Qemu-devel] [PATCH] cpu_ldst.h: Don't define helpers if MMU_MODE*_SUFFIX not defined

2015-01-16 Thread Peter Maydell

Not all targets define a full set of suffix strings for the
NB_MMU_MODES that they have. In this situation, don't define any
helper functions for that mode, rather than defining helper functions
with no suffix at all. The MMU mode is still functional; it is merely
not directly accessible via cpu_ld*_MODE from target helper functions.

Also add an NB_MMU_MODES = 2 check to the definition of the mode 1
helpers -- some targets only define one MMU mode.

Signed-off-by: Peter Maydell peter.mayd...@linaro.org
---
The alternative would be to put in guards requiring the suffix
macros to be defined, and fix up the targets which are missing them:
 arm (no 2 or 3)
 lm32 (no 0)
 moxie (no 0)
 openrisc (no 0 1 or 2)
 tricore (no 0 1 or 2)

I'm going to fix up ARM anyway, but I decided that just deal with
it was better than trying to look up MMU mode definitions for four
CPU types I don't know...

This patch applies after the 15-patch cpu_ldst series I sent out
the other day.

 include/exec/cpu_ldst.h | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index d98ff17..0e825ea 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -132,6 +132,7 @@ uint16_t helper_ldw_cmmu(CPUArchState *env, target_ulong 
addr, int mmu_idx);
 uint32_t helper_ldl_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
 uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
 
+#ifdef MMU_MODE0_SUFFIX
 #define CPU_MMU_INDEX 0
 #define MEMSUFFIX MMU_MODE0_SUFFIX
 #define DATA_SIZE 1
@@ -147,7 +148,9 @@ uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong 
addr, int mmu_idx);
 #include exec/cpu_ldst_template.h
 #undef CPU_MMU_INDEX
 #undef MEMSUFFIX
+#endif
 
+#if (NB_MMU_MODES = 2)  defined(MMU_MODE1_SUFFIX)
 #define CPU_MMU_INDEX 1
 #define MEMSUFFIX MMU_MODE1_SUFFIX
 #define DATA_SIZE 1
@@ -163,8 +166,9 @@ uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong 
addr, int mmu_idx);
 #include exec/cpu_ldst_template.h
 #undef CPU_MMU_INDEX
 #undef MEMSUFFIX
+#endif
 
-#if (NB_MMU_MODES = 3)
+#if (NB_MMU_MODES = 3)  defined(MMU_MODE2_SUFFIX)
 
 #define CPU_MMU_INDEX 2
 #define MEMSUFFIX MMU_MODE2_SUFFIX
@@ -183,7 +187,7 @@ uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong 
addr, int mmu_idx);
 #undef MEMSUFFIX
 #endif /* (NB_MMU_MODES = 3) */
 
-#if (NB_MMU_MODES = 4)
+#if (NB_MMU_MODES = 4)  defined(MMU_MODE3_SUFFIX)
 
 #define CPU_MMU_INDEX 3
 #define MEMSUFFIX MMU_MODE3_SUFFIX
@@ -202,7 +206,7 @@ uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong 
addr, int mmu_idx);
 #undef MEMSUFFIX
 #endif /* (NB_MMU_MODES = 4) */
 
-#if (NB_MMU_MODES = 5)
+#if (NB_MMU_MODES = 5)  defined(MMU_MODE4_SUFFIX)
 
 #define CPU_MMU_INDEX 4
 #define MEMSUFFIX MMU_MODE4_SUFFIX
@@ -221,7 +225,7 @@ uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong 
addr, int mmu_idx);
 #undef MEMSUFFIX
 #endif /* (NB_MMU_MODES = 5) */
 
-#if (NB_MMU_MODES = 6)
+#if (NB_MMU_MODES = 6)  defined(MMU_MODE5_SUFFIX)
 
 #define CPU_MMU_INDEX 5
 #define MEMSUFFIX MMU_MODE5_SUFFIX
-- 
1.9.1

Re: [Qemu-devel] [PATCH v11 08/13] block: Add bitmap successors

2015-01-16 Thread John Snow




On 01/13/2015 04:24 AM, Fam Zheng wrote:

On Mon, 01/12 11:31, John Snow wrote:

A bitmap successor is an anonymous BdrvDirtyBitmap that is intended to
be created just prior to a sensitive operation (e.g. Incremental Backup)
that can either succeed or fail, but during the course of which we still
want a bitmap tracking writes.

On creating a successor, we freeze the parent bitmap which prevents
its deletion, enabling, anonymization, or creating a bitmap with the
same name.

On success, the parent bitmap can abdicate responsibility to the
successor, which will inherit its name. The successor will have been
tracking writes during the course of the backup operation. The parent
will be safely deleted.

On failure, we can reclaim the successor from the parent, unifying
them such that the resulting bitmap describes all writes occurring since
the last successful backup, for instance. Reclamation will thaw the
parent, but not explicitly re-enable it.


If we compare this with image snapshot and overlay, it fits the current backing
chain model very well.  Given that we're on the way to persistent dirty bitmap,
which will probably be read/written with general block.c code, I'm wondering if
it will be any better to reuse the block layer backing file code to handle
successor logic?

Also with the frozen feature built in dirty bitmaps, is it okay to drop
enabled state?

I think there are three states for a bitmap:

1) Active, no successor (similar to an read-write top image)
2) Frozen, no successor (similar to an read-only top image)
3) Frozen, with successor (similar to an read-only backing file, with an
overlap)

Admittedly, more code is needed than this patch, in order to glue hbitmap and
block layer together, but it would probably make live migration of dirty bitmap
very easy, but I'm not sure without a closer look.



Signed-off-by: John Snow js...@redhat.com
---
  block.c   | 123 --
  blockdev.c|  14 ++
  include/block/block.h |  10 
  3 files changed, 144 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index bd4b449..3f33b9d 100644
--- a/block.c
+++ b/block.c
@@ -53,10 +53,12 @@

  struct BdrvDirtyBitmap {
  HBitmap *bitmap;
+BdrvDirtyBitmap *successor;
  int64_t size;
  int64_t granularity;
  char *name;
  bool enabled;
+bool frozen;
  QLIST_ENTRY(BdrvDirtyBitmap) list;
  };

@@ -5342,6 +5344,7 @@ BdrvDirtyBitmap *bdrv_find_dirty_bitmap(BlockDriverState 
*bs, const char *name)

  void bdrv_dirty_bitmap_make_anon(BlockDriverState *bs, BdrvDirtyBitmap 
*bitmap)
  {
+assert(!bitmap-frozen);
  g_free(bitmap-name);
  bitmap-name = NULL;
  }
@@ -5379,9 +5382,114 @@ BdrvDirtyBitmap 
*bdrv_create_dirty_bitmap(BlockDriverState *bs,
  return bitmap;
  }

+/**
+  * A frozen bitmap cannot be renamed, deleted, cleared,
+  * or set. A frozen bitmap can only abdicate, reclaim,
+  * or thaw.
+  */
+static void bdrv_freeze_dirty_bitmap(BdrvDirtyBitmap *bitmap)
+{
+bitmap-frozen = true;
+}
+
+static void bdrv_thaw_dirty_bitmap(BdrvDirtyBitmap *bitmap)
+{
+bitmap-frozen = false;
+}
+
+bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap)
+{
+return bitmap-frozen;
+}
+
  bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap)
  {
-return bitmap-enabled;
+return bitmap-enabled  !bitmap-frozen;


An indication that enabled could be replaced by frozen. Otherwise it looks
confusing to me.


+}
+
+/**
+ * Create a successor bitmap destined to replace this bitmap after an 
operation.
+ * Requires that the bitmap is not frozen and has no successor.
+ */
+int bdrv_dirty_bitmap_create_successor(BlockDriverState *bs,
+   BdrvDirtyBitmap *bitmap, Error **errp)
+{
+uint64_t granularity;
+
+if (bdrv_dirty_bitmap_frozen(bitmap)) {
+error_setg(errp,
+   Cannot create a successor for a bitmap currently in-use.);
+return -1;
+} else if (bitmap-successor) {
+error_setg(errp, Cannot create a successor for a bitmap that 
+   already has one.);
+return -1;
+}
+
+/* Create an anonymous successor */
+granularity = bdrv_dirty_bitmap_granularity(bs, bitmap);
+bitmap-successor = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
+if (!bitmap-successor) {
+return -1;
+}
+
+/* Successor will be on or off based on our current state.
+ * Parent will be disabled and frozen. */
+bitmap-successor-enabled = bitmap-enabled;
+bdrv_disable_dirty_bitmap(bitmap);
+bdrv_freeze_dirty_bitmap(bitmap);
+return 0;
+}
+
+/**
+ * For a bitmap with a successor, yield our name to the successor,
+ * Delete the old bitmap, and return a handle to the new bitmap.
+ */
+BdrvDirtyBitmap *bdrv_dirty_bitmap_abdicate(BlockDriverState *bs,
+BdrvDirtyBitmap *bitmap,
+

Re: [Qemu-devel] [PATCH v7 05/32] target-arm: make arm_current_el() return EL3

2015-01-16 Thread Peter Maydell

On 21 October 2014 at 17:55, Greg Bellows greg.bell...@linaro.org wrote:
 From: Fabian Aggeler aggel...@ethz.ch

 Make arm_current_el() return EL3 for secure PL1 and monitor mode.
 Increase MMU modes since mmu_index is directly inferred from arm_
 current_el(). Change assertion in arm_el_is_aa64() to allow EL3.

 -#define NB_MMU_MODES 2
 +#define NB_MMU_MODES 4

So this turns out not to quite be what we want.
A QEMU MMU mode index basically defines a (vaddr - paddr,permissions)
mapping. This is similar to the ARM ARM concept of a translation
regime, with the differences that:
 * the ARM ARM translation regimes may have split permissions,
   for user and privileged code, so we need two mmu_idx values
   for a translation regime that applies to both EL0 and EL1
 * stage 1 and stage 2 translations for a VA-IPA-PA lookup
   for an EL1/EL0 hypervisor guest are two different translation
   regimes, but for QEMU we can just cache the whole VA-PA
   and use a single mmu_idx. [We only need to separately do
   VA-IPA and IPA-VA for the do this address translation
   system instructions, which don't need to touch the TLB;
   a combined stage1+stage2 TLB is permitted by the architecture.]

The translation regimes are:

If EL3 is 64-bit:
 * Secure EL3
 * Secure EL1  EL0
 * NonSecure EL2
 * NonSecure EL1  0 stage 1
 * NonSecure EL1  0 stage 2
If EL3 is 32-bit:
 * Secure PL0  PL1
 * NonSecure PL2
 * NonSecure PL1  0 stage 1
 * NonSecure PL1  0 stage 2
(reminder: for 32 bit EL3, Secure PL1 is *EL3*, not EL1.)

which we can give the following mmu indexes:

64 bit EL3:
 0 : NS EL0 stage 1+2
 1 : NS EL1 stage 1+2
 2 : NS EL2
 3 : S EL3
 4 : S EL0
 5 : S EL1

32 bit EL3:
 0 : NS EL0 (aka NS PL0) stage 1+2
 1 : NS EL1 (aka NS PL1) stage 1+2
 2 : NS EL2 (aka NS PL2)
 3 : S EL3 (aka S PL1)
 4 : S EL0 (aka S PL0)

Notice how they end up being the same, except that with a
64 bit EL3 we need an extra mmu index that 32 bit doesn't have.
They aren't simply what is our current EL?, though as you
can see I've put them in an order that comes close.

So the right answer for NB_MMU_MODES is 6 :-)

-- PMM

Re: [Qemu-devel] [PATCH] cpu_ldst.h: Don't define helpers if MMU_MODE*_SUFFIX not defined

2015-01-16 Thread Richard Henderson

On 01/16/2015 10:13 AM, Peter Maydell wrote:
 Not all targets define a full set of suffix strings for the
 NB_MMU_MODES that they have. In this situation, don't define any
 helper functions for that mode, rather than defining helper functions
 with no suffix at all. The MMU mode is still functional; it is merely
 not directly accessible via cpu_ld*_MODE from target helper functions.
 
 Also add an NB_MMU_MODES = 2 check to the definition of the mode 1
 helpers -- some targets only define one MMU mode.
 
 Signed-off-by: Peter Maydell peter.mayd...@linaro.org

Reviewed-by: Richard Henderson r...@twiddle.net


r~

Re: [Qemu-devel] [PATCH 00/15] Clean up cpu-ldst ld/st memory accessors

2015-01-16 Thread Richard Henderson

On 01/15/2015 07:01 AM, Peter Maydell wrote:
 I was looking at our confusing mess of memory accessor functions,
 and I realised that partly it was confusing because we have a
 bunch of unnecessary junk lurking in there :-) This series
 attempts to clean things up by removing things we weren't using
 at all or were only using by mistake in a few places:
 
  * ldul_*: not used
  * ld* (ldl, etc): hardly used
  * ld*_kernel: not used
  * ld*_raw: hardly used
  * cpu_{ld,st}{fq,fl}: not used
 
 The dull parts of this series are removing the unused macros
 and fixing uses of the hardly-used macros so those can be
 deleted too. This series also switches to using inline functions
 rather than macros for the user-only cpu_ld/st* accessors, bringing
 them into line with the softmmu configs. This has the nice
 side effect of letting us get rid of the _raw accessor macros too.
 I've also thrown in a commit which cleans up the doc comments.
 
 Peter Maydell (15):
   cpu_ldst.h: Remove unused ldul_ macros
   monitor.c: Use ld*_p() instead of ld*_raw()
   target-sparc: Don't use {ld,st}*_raw functions
   linux-user/elfload.c: Don't use _raw accessor functions
   bsd-user/elfload.c: Don't use ldl() or ldq_raw()
   linux-user/vm86.c: Use cpu_ldl_data c rather than plain ldl c
   linux-user/main.c (m68k): Use get_user_u16 rather than lduw in
 cpu_loop
   target-mips: Don't use _raw load/store accessors
   cpu_ldst.h: Drop unused ld/st*_kernel defines
   cpu_ldst.h: Remove unused very short ld*/st* defines
   cpu_ldst.h: Use inline functions for usermode cpu_ld/st accessors
   cpu_ldst_template.h: Use ld*_p directly rather than via ld*_raw macros
   cpu_ldst.h: Drop unused _raw macros, saddr() and laddr()
   cpu_ldst_template.h: Drop unused cpu_ldfq/stfq/ldfl/stfl accessors
   cpu_ldst.h, cpu-all.h, bswap.h: Update documentation on ld/st
 accessors

Reviewed-by: Richard Henderson r...@twiddle.net

Nice cleanup.

r~

Re: [Qemu-devel] [PATCH v2] qemu-iotests: Fix supported_oses check

2015-01-16 Thread John Snow




On 01/16/2015 08:23 AM, Stefan Hajnoczi wrote:

On Fri, Jan 16, 2015 at 09:38:42AM +0800, Fam Zheng wrote:

There is a bug in the recently added sys.platform test, and we no longer
run python tests, because linux2 is the value to compare here. So do a
prefix match. According to python doc [1], the way to use sys.platform
is unless you want to test for a specific system version, it is
therefore recommended to use the following idiom:

if sys.platform.startswith('freebsd'):
 # FreeBSD-specific code here...
elif sys.platform.startswith('linux'):
 # Linux-specific code here...

[1]: https://docs.python.org/2.7/library/sys.html#sys.platform

Signed-off-by: Fam Zheng f...@redhat.com

---
v2: Don't use any().
 Explain why prefix match is fine.
 (Thanks, Stefan)
---
  tests/qemu-iotests/iotests.py | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan



Thanks for this!

fake_internet_karma++

1 2 >

1 - 100 of 122 matches

Mail list logo