[PATCH] V2 Handle guest access to BBL_CR_CTL3 MSR

2011-01-20 Thread john cooper
[Resubmit of prior version which contained a wayward
patch hunk.  Thanks Marcelo]

A correction to Intel cpu model CPUID data (patch queued)
caused winxp to BSOD when booted with a Penryn model.
This was traced to the CPUID model field correction from
6 - 23 (as is proper for a Penryn class of cpu).  Only in
this case does the problem surface.

The cause for this failure is winxp accessing the BBL_CR_CTL3
MSR which is unsupported by current kvm, appears to be a
legacy MSR not fully characterized yet existing in current
silicon, and is apparently carried forward in MSR space to
accommodate vintage code as here.  It is not yet conclusive
whether this MSR implements any of its legacy functionality
or is just an ornamental dud for compatibility.  While I
found no silicon version specific documentation link to
this MSR, a general description exists in Intel's developer's
reference which agrees with the functional behavior of
other bootloader/kernel code I've examined accessing
BBL_CR_CTL3.  Regrettably winxp appears to be setting bit #19
called out as reserved in the above document.

So to minimally accommodate this MSR, kvm msr get will provide
the equivalent mock data and kvm msr write will simply toss the
guest passed data without interpretation.  While this treatment
of BBL_CR_CTL3 addresses the immediate problem, the approach may
be modified pending clarification from Intel.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 4d0dfa0..5bfafb6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -38,6 +38,7 @@
 
 #define MSR_MTRRcap0x00fe
 #define MSR_IA32_BBL_CR_CTL0x0119
+#define MSR_IA32_BBL_CR_CTL3   0x011e
 
 #define MSR_IA32_SYSENTER_CS   0x0174
 #define MSR_IA32_SYSENTER_ESP  0x0175
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bcc0efc..04d6c55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1592,6 +1592,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
} else
return set_msr_hyperv(vcpu, msr, data);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* Drop writes to this legacy MSR -- see rdmsr
+* counterpart for further detail.
+*/
+   pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
+   break;
default:
if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
return xen_hvm_config(vcpu, data);
@@ -1846,6 +1852,19 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
} else
return get_msr_hyperv(vcpu, msr, pdata);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* This legacy MSR exists but isn't fully documented in current
+* silicon.  It is however accessed by winxp in very narrow
+* scenarios where it sets bit #19, itself documented as
+* a reserved bit.  Best effort attempt to source coherent
+* read data here should the balance of the register be
+* interpreted by the guest:
+*
+* L2 cache control register 3: 64GB range, 256KB size,
+* enabled, latency 0x1, configured
+*/ 
+   data = 0xbe702111;
+   break;
default:
if (!ignore_msrs) {
pr_unimpl(vcpu, unhandled rdmsr: 0x%x\n, msr);


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Handle guest access to BBL_CR_CTL3 MSR

2011-01-18 Thread john cooper
Marcelo Tosatti wrote:
 On Sat, Jan 08, 2011 at 12:05:14AM -0500, john cooper wrote:
   
 A correction to Intel cpu model CPUID data (patch queued)
 caused winxp-64 to BSOD when booted with a Penryn model.
 This was traced to the CPUID model field correction from
 6 - 23 (as is proper for a Penryn class of cpu).  Only in
 this case does the problem surface.

 The cause for this failure is winxp accessing the BBL_CR_CTL3
 MSR which is unsupported by current kvm, appears to be a
 legacy MSR not fully characterized yet existing in current
 silicon, and is apparently carried forward in MSR space to
 accommodate vintage code as here.  It is not yet conclusive
 whether this MSR implements any of its legacy functionality
 or is just an ornamental dud for compatibility.  While I
 found no silicon version specific documentation link to
 this MSR, a general description exists in Intel's developer's
 reference which agrees with the functional behavior of
 other bootloader/kernel code I've examined accessing
 BBL_CR_CTL3.  Regrettably winxp-64 appears to be setting bit #19
 called out as reserved in the above document.

 So to minimally accommodate this MSR, kvm msr get will provide
 the equivalent mock data and kvm msr write will simply toss the
 guest passed data without interpretation.  While this treatment
 of BBL_CR_CTL3 addresses the immediate problem, the approach may
 be modified pending clarification from Intel.

 Signed-off-by: john cooper john.coo...@redhat.com
 ---

 diff --git a/arch/x86/include/asm/msr-index.h 
 b/arch/x86/include/asm/msr-index.h
 index 6b89f5e..145cd60 100644
 --- a/arch/x86/include/asm/msr-index.h
 +++ b/arch/x86/include/asm/msr-index.h
 @@ -38,6 +38,7 @@
  
  #define MSR_MTRRcap 0x00fe
  #define MSR_IA32_BBL_CR_CTL 0x0119
 +#define MSR_IA32_BBL_CR_CTL30x011e
  
  #define MSR_IA32_SYSENTER_CS0x0174
  #define MSR_IA32_SYSENTER_ESP   0x0175
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fa708c9..9a8331c 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1283,6 +1283,12 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 
 msr, u64 data)
  return -1;
  vcpu-arch.mcg_ctl = data;
  break;
 +case MSR_IA32_BBL_CR_CTL3:
 +/* Drop writes to this legacy MSR -- see rdmsr
 + * counterpart for further detail.
 + */
 +pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
 +break;
  default:
  if (msr = MSR_IA32_MC0_CTL 
  msr  MSR_IA32_MC0_CTL + 4 * bank_num) {
 @@ -1592,6 +1598,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 
 msr, u64 data)
  } else
  return set_msr_hyperv(vcpu, msr, data);
  break;
 +case MSR_IA32_BBL_CR_CTL3:
 +/* This legacy MSR exists but isn't fully documented in current
 + * silicon.  It is however accessed by winxp in very narrow
 + * scenarios where it sets bit #19, itself documented as
 + * a reserved bit.  Best effort attempt to source coherent
 + * read data here should the balance of the register be
 + * interpreted by the guest:
 + *
 + * L2 cache control register 3: 64GB range, 256KB size,
 + * enabled, latency 0x1, configured
 + */ 
 +data = 0xbe702111;
 +break;
  default:
  if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
  return xen_hvm_config(vcpu, data);
 

 This is the MSR write path ?
   
The write path ignores the data. The MSR read returns a
mock-up of BBL_CR_CTL3 data. The above addresses the
narrow usage leading to the immediate bug. From the
feedback thus far from Intel I believe the above is sufficient
and minimal.

-john



-- 
john.coo...@third-harmonic.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Handle guest access to BBL_CR_CTL3 MSR

2011-01-07 Thread john cooper
A correction to Intel cpu model CPUID data (patch queued)
caused winxp-64 to BSOD when booted with a Penryn model.
This was traced to the CPUID model field correction from
6 - 23 (as is proper for a Penryn class of cpu).  Only in
this case does the problem surface.

The cause for this failure is winxp accessing the BBL_CR_CTL3
MSR which is unsupported by current kvm, appears to be a
legacy MSR not fully characterized yet existing in current
silicon, and is apparently carried forward in MSR space to
accommodate vintage code as here.  It is not yet conclusive
whether this MSR implements any of its legacy functionality
or is just an ornamental dud for compatibility.  While I
found no silicon version specific documentation link to
this MSR, a general description exists in Intel's developer's
reference which agrees with the functional behavior of
other bootloader/kernel code I've examined accessing
BBL_CR_CTL3.  Regrettably winxp-64 appears to be setting bit #19
called out as reserved in the above document.

So to minimally accommodate this MSR, kvm msr get will provide
the equivalent mock data and kvm msr write will simply toss the
guest passed data without interpretation.  While this treatment
of BBL_CR_CTL3 addresses the immediate problem, the approach may
be modified pending clarification from Intel.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 6b89f5e..145cd60 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -38,6 +38,7 @@
 
 #define MSR_MTRRcap0x00fe
 #define MSR_IA32_BBL_CR_CTL0x0119
+#define MSR_IA32_BBL_CR_CTL3   0x011e
 
 #define MSR_IA32_SYSENTER_CS   0x0174
 #define MSR_IA32_SYSENTER_ESP  0x0175
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fa708c9..9a8331c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1283,6 +1283,12 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return -1;
vcpu-arch.mcg_ctl = data;
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* Drop writes to this legacy MSR -- see rdmsr
+* counterpart for further detail.
+*/
+   pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
+   break;
default:
if (msr = MSR_IA32_MC0_CTL 
msr  MSR_IA32_MC0_CTL + 4 * bank_num) {
@@ -1592,6 +1598,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
} else
return set_msr_hyperv(vcpu, msr, data);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* This legacy MSR exists but isn't fully documented in current
+* silicon.  It is however accessed by winxp in very narrow
+* scenarios where it sets bit #19, itself documented as
+* a reserved bit.  Best effort attempt to source coherent
+* read data here should the balance of the register be
+* interpreted by the guest:
+*
+* L2 cache control register 3: 64GB range, 256KB size,
+* enabled, latency 0x1, configured
+*/ 
+   data = 0xbe702111;
+   break;
default:
if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
return xen_hvm_config(vcpu, data);


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Looking at using KVM for embedded product

2010-06-23 Thread john cooper
Tom Shoes wrote:
 Hi there,
 
 I am looking at using KVM for an embedded product. I am also new
 to Virtualization so pardon if
 I ask dumb questions. This is my first time posting to this forum.
 
The embedded product that need to run KVM has:
 
  a. Intel processor with VT
  b. BIOS supports enabling VT
  c. Linux kernel 2.6.26 (from kernel.org)
  d. No VGA adapter
  e. Serial console
  f. BusyBox

Busybox is an interesting requirement in that context.  If
you are constrained with userland size and linking against
other than glibc, use of qemu could be interesting.  Can't
say I've built it other than linked against glibc and an
extensive list of runtime libraries.  Although I've never
tried to configure-down that dependency. 

-john

-- 
john.coo...@third-harmonic.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread john cooper
Rusty Russell wrote:
 On Sat, 19 Jun 2010 04:08:02 am Ryan Harper wrote:
 Create a new attribute for virtio-blk devices that will fetch the serial 
 number
 of the block device.  This attribute can be used by udev to create disk/by-id
 symlinks for devices that don't have a UUID (filesystem) associated with 
 them.

 ATA_IDENTIFY strings are special in that they can be up to 20 chars long
 and aren't required to be NULL-terminated.  The buffer is also zero-padded
 meaning that if the serial is 19 chars or less that we get a NULL terminated
 string.  When copying this value into a string buffer, we must be careful to
 copy up to the NULL (if it present) and only 20 if it is longer and not to
 attempt to NULL terminate; this isn't needed.

 Signed-off-by: Ryan Harper ry...@us.ibm.com
 Signed-off-by: john cooper john.coo...@redhat.com
 ---
  drivers/block/virtio_blk.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 258bc2a..f1ef26f 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -281,6 +281,31 @@ static int index_to_minor(int index)
  return index  PART_BITS;
  }
  
 +/* Copy serial number from *s to *d.  Copy operation terminates on either
 + * encountering a nul in *s or after n bytes have been copied, whichever
 + * occurs first.  *d is not forcibly nul terminated.  Return # of bytes 
 copied.
 + */
 +static inline int serial_sysfs(char *d, char *s, int n)
 +{
 +char *di = d;
 +
 +while (*s  n--)
 +*d++ = *s++;
 +return d - di;
 +}
 +
 +static ssize_t virtblk_serial_show(struct device *dev,
 +struct device_attribute *attr, char *buf)
 +{
 +struct gendisk *disk = dev_to_disk(dev);
 +char id_str[VIRTIO_BLK_ID_BYTES];
 +
 +if (IS_ERR(virtblk_get_id(disk, id_str)))
 +return 0;
 
 0?  Really?  That doesn't seem very informative.

Propagating a prospective error from virtblk_get_id() should
be possible.  Unsure if doing so is more useful from the
user's perspective compared to just a nul id string.

 +return serial_sysfs(buf, id_str, min(VIRTIO_BLK_ID_BYTES, PAGE_SIZE));
 
 How about something like this:
 
   BUILD_BUG_ON(PAGE_SIZE  VIRTIO_BLK_ID_BYTES + 1);

Agreed, that's a better wrench in the gearworks.
Note padding buf[] by 1 isn't necessary as indicated
below.

   /* id_str is not necessarily nul-terminated! */
   buf[VIRTIO_BLK_ID_BYTES] = '\0';
   return virtblk_get_id(disk, buf);

The /sys file is rendered according to the length
returned from this function and the trailing nul
is not interpreted in this context.  In fact if a
nul is added and included in the byte count of the
string it will appear in the /sys file.

Thanks,

-john


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread john cooper
Ryan Harper wrote:
 * john cooper john.coo...@redhat.com [2010-06-21 01:11]:
 Rusty Russell wrote:
 On Sat, 19 Jun 2010 04:08:02 am Ryan Harper wrote:
 Create a new attribute for virtio-blk devices that will fetch the serial 
 number
 of the block device.  This attribute can be used by udev to create 
 disk/by-id
 symlinks for devices that don't have a UUID (filesystem) associated with 
 them.

 ATA_IDENTIFY strings are special in that they can be up to 20 chars long
 and aren't required to be NULL-terminated.  The buffer is also zero-padded
 meaning that if the serial is 19 chars or less that we get a NULL 
 terminated
 string.  When copying this value into a string buffer, we must be careful 
 to
 copy up to the NULL (if it present) and only 20 if it is longer and not to
 attempt to NULL terminate; this isn't needed.

 Signed-off-by: Ryan Harper ry...@us.ibm.com
 Signed-off-by: john cooper john.coo...@redhat.com
 ---
  drivers/block/virtio_blk.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 258bc2a..f1ef26f 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -281,6 +281,31 @@ static int index_to_minor(int index)
return index  PART_BITS;
  }
  
 +/* Copy serial number from *s to *d.  Copy operation terminates on either
 + * encountering a nul in *s or after n bytes have been copied, whichever
 + * occurs first.  *d is not forcibly nul terminated.  Return # of bytes 
 copied.
 + */
 +static inline int serial_sysfs(char *d, char *s, int n)
 +{
 +  char *di = d;
 +
 +  while (*s  n--)
 +  *d++ = *s++;
 +  return d - di;
 +}
 +
 +static ssize_t virtblk_serial_show(struct device *dev,
 +  struct device_attribute *attr, char *buf)
 +{
 +  struct gendisk *disk = dev_to_disk(dev);
 +  char id_str[VIRTIO_BLK_ID_BYTES];
 +
 +  if (IS_ERR(virtblk_get_id(disk, id_str)))
 +  return 0;
 0?  Really?  That doesn't seem very informative.
 Propagating a prospective error from virtblk_get_id() should
 be possible.  Unsure if doing so is more useful from the
 user's perspective compared to just a nul id string.
 
 I'm not sure we can do any thing else here; maybe printk a warning?
 
 Documentation/filesystems/sysfs.txt says that showing attributes should
 always return the number of chars put into the buffer; so when there is
 an error; zero is the right value to return since we're not filling the
 buffer.

So we return a nul string in the case the qemu user
didn't specify an id string and also in the case a
legacy qemu doesn't support retrieval of an id string.
Not too much difference and if needed going forward the
error return can be elaborated.

 /* id_str is not necessarily nul-terminated! */
 buf[VIRTIO_BLK_ID_BYTES] = '\0';
 return virtblk_get_id(disk, buf);
 The /sys file is rendered according to the length
 returned from this function and the trailing nul
 is not interpreted in this context.  In fact if a
 nul is added and included in the byte count of the
 string it will appear in the /sys file.
 
 Yeah; I like the simplicity; but we do need to know how long the string
 is so we can return that value. 

Which we're getting from serial_sysfs() without
having to accommodate an unused nul.  I'd hazard the
primary reason the sysfs calling code keys off a
return of byte count vs. traversing the string itself
is due to the called function almost always having the
byte count available.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Add 'serial' attribute to virtio-blk devices

2010-06-21 Thread john cooper
Rusty Russell wrote:
 On Tue, 22 Jun 2010 02:13:21 am Ryan Harper wrote:
 * john cooper john.coo...@redhat.com [2010-06-21 01:11]:
 Rusty Russell wrote:
/* id_str is not necessarily nul-terminated! */
buf[VIRTIO_BLK_ID_BYTES] = '\0';
return virtblk_get_id(disk, buf);
 The /sys file is rendered according to the length
 returned from this function and the trailing nul
 is not interpreted in this context.  In fact if a
 nul is added and included in the byte count of the
 string it will appear in the /sys file.
 Yeah; I like the simplicity; but we do need to know how long the string
 is so we can return that value. 
 
 So we're looking at something like:
 
   /* id_str is not necessarily nul-terminated! */
   buf[VIRTIO_BLK_ID_BYTES] = '\0';
   err = virtblk_get_id(disk, buf);
   if (!err)
   return strlen(buf);
   if (err == -EIO) /* Unsupported?  Make it empty. */ 
   return 0;
   return err;

In my haste reading your prior mail, I'd glossed over
the fact you were copying direct to the sysfs buf.  So
in retrospect that (and the above) do make sense.

Thanks,

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Remove virtio_blk VBID ioctl

2010-06-20 Thread john cooper
Rusty Russell wrote:
 On Sat, 19 Jun 2010 04:08:03 am Ryan Harper wrote:
 With the availablility of a sysfs device attribute for examining disk serial
 numbers the ioctl is no longer needed.  The user-space changes for this 
 aren't
 upstream yet so we don't have any users to worry about.
 
 If John Cooper acks this, I'll push it to Linus immediately.

Actually I'm the one who suggested removing it.
The code in question was only intended as example
usage of accessing the s/n data in the driver, for
the /sys interface under discussion back then.
That effort subsequently stalled and Ryan had
recently picked it up.  As such I believe this
overshadows the general need for an ioctl.  Even if
for some reason an ioctl would be justified going
forward, a more usage friendly form would be better.
So let's just drop it for now as the corresponding
qemu-side code hasn't been merged.
 
 Unfortunately we offered this interface in 2.6.34, and we're now removing it.
 That's unpleasant.

Indeed.  This entire effort, aside from being an exercise
in protracted agony, probably violates a Rube Goldberg
patent.

Thanks,

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread john cooper
Avi Kivity wrote:
 On 06/01/2010 07:38 PM, Andi Kleen wrote:
 Your new code would starve again, right?


 Yes, of course it may starve with unfair spinlock. Since vcpus are not
 always running there is much smaller chance then vcpu on remote memory
 node will starve forever. Old kernels with unfair spinlocks are running
 fine in VMs on NUMA machines with various loads.
  
 Try it on a NUMA system with unfair memory.

 
 We are running everything on NUMA (since all modern machines are now
 NUMA).  At what scale do the issues become observable?
 
 I understand that reason and do not propose to get back to old spinlock
 on physical HW! But with virtualization performance hit is unbearable.
  
 Extreme unfairness can be unbearable too.

 
 Well, the question is what happens first.  In our experience, vcpu
 overcommit is a lot more painful.  People will never see the NUMA
 unfairness issue if they can't use kvm due to the vcpu overcommit problem.

Gleb's observed performance hit seems to be a rather mild
throughput depression compared with creating a worst case by
enforcing vcpu overcommit.  Running a single guest with 2:1
overcommit on a 4 core machine I saw over an order of magnitude
slowdown vs. 1:1 commit with the same kernel build test.
Others have reported similar results.

How close you'll get to that scenario depends on host
scheduling dynamics, and statistically the number of opened
and stalled lock held paths waiting to be contended.  So
I'd expect to see quite variable numbers for guest-guest
aggravation of this problem.

 What I'd like to see eventually is a short-term-unfair, long-term-fair
 spinlock.  Might make sense for bare metal as well.  But it won't be
 easy to write.

Collecting the contention/usage statistics on a per spinlock
basis seems complex.  I believe a practical approximation
to this are adaptive mutexes where upon hitting a spin
time threshold, punt and let the scheduler reconcile fairness.

-john

-- 
john.coo...@third-harmonic.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Add cpu model configuration support.. (resend)

2010-02-13 Thread john cooper
Anthony Liguori wrote:
 On 02/01/2010 01:02 PM, john cooper wrote:
 [target-x86_64.conf was unintentionally omitted from the earlier patch]

 This is a reimplementation of prior versions which adds
 the ability to define cpu models for contemporary processors.
 The added models are likewise selected via -cpuname,
 and are intended to displace the existing convention
 of -cpu qemu64 augmented with a series of feature flags

 
 This breaks the arm-softmmu build.

Ugh, does indeed as well as a few other builds.
Updated patch follows.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add cpu model configuration support..

2010-02-13 Thread john cooper
[Revision to fix build breakage for a few targets.  This
does not yet reflect Andre's suggestion to coalesce all
config file flags into one space, the implementation of
which depends somewhat upon acceptance of the proposed
config file syntax modification and is left as a TBD for
now.]


This is a reimplementation of prior versions which adds
the ability to define cpu models for contemporary processors.
The added models are likewise selected via -cpu name,
and are intended to displace the existing convention
of -cpu qemu64 augmented with a series of feature flags.

A primary motivation was determination of a least common
denominator within a given processor class to simplify guest
migration.  It is still possible to modify an arbitrary model
via additional feature flags however the goal here was to
make doing so unnecessary in typical usage.  The other
consideration was providing models names reflective of
current processors.  Both AMD and Intel have reviewed the
models in terms of balancing generality of migration vs.
excessive feature downgrade relative to released silicon. 

This version of the patch replaces the prior hard wired
definitions with a configuration file approach for new
models.  Existing models are thus far left as-is but may
easily be transitioned to (or may be overridden by) the
configuration file representation.

Proposed new model definitions are provided here for current
AMD and Intel processors.  Each model consists of a name
used to select it on the command line (-cpu name), and a
model_id which corresponds to a least common denominator
commercial instance of the processor class.

A table of names/model_ids may be queried via -cpu ?model:

:
x86   Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)  
x86   Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)  
x86   Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)   
x86  Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)   
x86   Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)
x86   Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)
: 

Also added is -cpu ?dump which exhaustively outputs all config
data for all defined models, and -cpu ?cpuid which enumerates
all qemu recognized CPUID feature flags.

The pseudo cpuid flag 'check' when added to the feature flag list
will warn when feature flags (either implicit in a cpu model or
explicit on the command line) would have otherwise been quietly
unavailable to a guest:

# qemu-system-x86_64 ... -cpu Nehalem,check
warning: host cpuid _0001 lacks requested flag 'sse4.2|sse4_2' 
[0x0010]
warning: host cpuid _0001 lacks requested flag 'popcnt' [0x0080]

A similar 'enforce' pseudo flag exists which in addition
to the above causes qemu to error exit if requested flags are
unavailable.

Configuration data for a cpu model resides in the target config
file which by default will be installed as:

/usr/local/etc/qemu/target-arch.conf

The format of this file should be self explanatory given the
definitions for the above six models and essentially mimics
the structure of the static x86_def_t x86_defs.

Encoding of cpuid flags names now allows aliases for both the
configuration file and the command line which reconciles some
Intel/AMD/Linux/Qemu naming differences.

This patch was tested relative to qemu.git.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/Makefile b/Makefile
index c72a059..c6629d6 100644
--- a/Makefile
+++ b/Makefile
@@ -191,7 +191,11 @@ ifdef CONFIG_POSIX
$(INSTALL_DATA) qemu-nbd.8 $(DESTDIR)$(mandir)/man8
 endif
 
-install: all $(if $(BUILD_DOCS),install-doc)
+install-sysconfig:
+   $(INSTALL_DIR) $(sysconfdir)/qemu
+   $(INSTALL_DATA) sysconfigs/target/target-x86_64.conf 
$(sysconfdir)/qemu
+
+install: all $(if $(BUILD_DOCS),install-doc) install-sysconfig
$(INSTALL_DIR) $(DESTDIR)$(bindir)
 ifneq ($(TOOLS),)
$(INSTALL_PROG) $(STRIP_OPT) $(TOOLS) $(DESTDIR)$(bindir)
diff --git a/linux-user/main.c b/linux-user/main.c
index a0d8ce7..1189dda 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -2553,6 +2553,10 @@ int main(int argc, char **argv, char **envp)
 }
 
 cpu_model = NULL;
+#if defined(cpudef_setup)
+cpudef_setup(); /* parse cpu definitions in target config file (TBD) */
+#endif
+
 optind = 1;
 for(;;) {
 if (optind = argc)
@@ -2624,8 +2628,8 @@ int main(int argc, char **argv, char **envp)
 cpu_model = argv[optind++];
 if (cpu_model == NULL || strcmp(cpu_model, ?) == 0) {
 /* XXX: implement xxx_cpu_list for targets that still miss it */
-#if defined(cpu_list)
-cpu_list(stdout, fprintf);
+#if defined(cpu_list_id)
+cpu_list_id(stdout, fprintf, );
 #endif
 exit(1);
 }
diff --git a/qemu-config.c b/qemu-config.c
index c3203c8..246fae6 100644
--- a/qemu

Re: [Qemu-devel] [PATCH] Add assignment operation to config file parser..

2010-02-08 Thread john cooper
Gerd Hoffmann wrote:
 On 02/07/10 17:24, Anthony Liguori wrote:
 On 02/06/2010 12:59 PM, john cooper wrote:
 This patch reworks support for both assignment and
 append in the config file parser. It was motivated
 by comments received on the cpu model config file
 format.

 Commit dc9ca4ba27be4fe6a0284061b8f056c4364fb0d9
 changed the behavior of = from assign to append.
 This patch preserves the ability to append to an
 option (however now via +=), reverts = to its
 previous behavior, and allows both to interoperate.

 Signed-off-by: john cooperjohn.coo...@redhat.com

 This deviates from standard ini syntax which makes me a big
 uncomfortable with it. Gerd, do you have an opinion?
 
 Also it the syntax change will break existing users of the append
 feature (host/guestfwd for slirp networking).
 
 Any reason why you can't use the current append to avoid the overlong
 feature flag lines?

Impacting existing usage was my primary concern as well.
I'd run this by Mark who created the patch changing the
disposition of = to append and I didn't find any alarms
there.

The proposed scheme supports both semantics and arguably
seems more intuitive.  If that is the general consensus
and it won't unduly perturb existing usage the above
would be opportune before the current behavior becomes
more entrenched (e.g. by further dependencies such as the
proposed cpu model configuration).

But to answer the question, my prior patch can work with
either.

Thanks,

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add assignment operation to config file parser..

2010-02-06 Thread john cooper
This patch reworks support for both assignment and
append in the config file parser.  It was motivated
by comments received on the cpu model config file
format.

Commit dc9ca4ba27be4fe6a0284061b8f056c4364fb0d9
changed the behavior of = from assign to append.
This patch preserves the ability to append to an
option (however now via +=), reverts = to its
previous behavior, and allows both to interoperate.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/qemu-config.c b/qemu-config.c
index 246fae6..4e53250 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -429,6 +429,7 @@ int qemu_config_parse(FILE *fp)
 char line[1024], group[64], id[64], arg[64], value[1024];
 QemuOptsList *list = NULL;
 QemuOpts *opts = NULL;
+char append;
 
 while (fgets(line, sizeof(line), fp) != NULL) {
 if (line[0] == '\n') {
@@ -455,13 +456,16 @@ int qemu_config_parse(FILE *fp)
 opts = qemu_opts_create(list, NULL, 0);
 continue;
 }
-if (sscanf(line,  %63s = \%1023[^\]\, arg, value) == 2) {
-/* arg = value */
+append = 0;
+if (sscanf(line,  %63s = \%1023[^\]\, arg, value) == 2 ||
+(sscanf(line,  %63s += \%1023[^\]\, arg, value) == 2 ?
+append = 1 : 0)) {
+/* arg = value, arg += value */
 if (opts == NULL) {
 fprintf(stderr, no group defined\n);
 return -1;
 }
-if (qemu_opt_set(opts, arg, value) != 0) {
+if (_qemu_opt_set(opts, arg, value, append) != 0) {
 fprintf(stderr, failed to set \%s\ for %s\n,
 arg, group);
 return -1;
diff --git a/qemu-option.c b/qemu-option.c
index a52a4c4..7c0faed 100644
--- a/qemu-option.c
+++ b/qemu-option.c
@@ -562,7 +562,11 @@ static void qemu_opt_del(QemuOpt *opt)
 qemu_free(opt);
 }
 
-int qemu_opt_set(QemuOpts *opts, const char *name, const char *value)
+/* add option *name,*value to group *opts.  if append add to tail of option
+ * list, else set as sole element (overwrite any existing entries of *name).
+ */
+int _qemu_opt_set(QemuOpts *opts, const char *name, const char *value,
+  char append)
 {
 QemuOpt *opt;
 QemuOptDesc *desc = opts-list-desc;
@@ -582,13 +586,27 @@ int qemu_opt_set(QemuOpts *opts, const char *name, const 
char *value)
 return -1;
 }
 }
-
-opt = qemu_mallocz(sizeof(*opt));
-opt-name = qemu_strdup(name);
-opt-opts = opts;
-QTAILQ_INSERT_TAIL(opts-head, opt, next);
-if (desc[i].name != NULL) {
-opt-desc = desc+i;
+if (append || !(opt = qemu_opt_find(opts, name))) {
+opt = qemu_mallocz(sizeof(*opt));
+opt-name = qemu_strdup(name);
+opt-opts = opts;
+QTAILQ_INSERT_TAIL(opts-head, opt, next);
+if (desc[i].name != NULL) {
+opt-desc = desc+i;
+}
+} else if (!append) {
+QemuOpt *p, *next;
+
+/* assign to reclaimed *opt, remove all other *name defs */
+QTAILQ_FOREACH_SAFE(p, opts-head, next, next) {
+if (p != opt  !strcmp(name, p-name)) {
+qemu_free((char *)p-str);
+QTAILQ_REMOVE(opts-head, p, next);
+qemu_free((char *)p);
+}
+}
+qemu_free((char *)opt-str);
+opt-str = NULL;
 }
 if (value) {
 opt-str = qemu_strdup(value);
diff --git a/qemu-option.h b/qemu-option.h
index 666b666..2385b1a 100644
--- a/qemu-option.h
+++ b/qemu-option.h
@@ -104,7 +104,14 @@ const char *qemu_opt_get(QemuOpts *opts, const char *name);
 int qemu_opt_get_bool(QemuOpts *opts, const char *name, int defval);
 uint64_t qemu_opt_get_number(QemuOpts *opts, const char *name, uint64_t 
defval);
 uint64_t qemu_opt_get_size(QemuOpts *opts, const char *name, uint64_t defval);
-int qemu_opt_set(QemuOpts *opts, const char *name, const char *value);
+int _qemu_opt_set(QemuOpts *opts, const char *name, const char *value,
+  char append);
+static inline int qemu_opt_set(QemuOpts *opts, const char *name,
+   const char *value)
+{
+return (_qemu_opt_set(opts, name, value, 0));
+}
+
 typedef int (*qemu_opt_loopfunc)(const char *name, const char *value, void 
*opaque);
 int qemu_opt_foreach(QemuOpts *opts, qemu_opt_loopfunc func, void *opaque,
  int abort_on_failure);
-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add cpu model configuration support.. (resend)

2010-02-02 Thread john cooper
Andre Przywara wrote:

 +[cpudef]
 +   name = Conroe
 +   level = 2
 +   vendor = GenuineIntel
 +   family = 6
 +   model = 2
 +   stepping = 3
 +   feature_edx = sse2 sse fxsr mmx pat cmov pge sep apic cx8 mce pae
 msr tsc pse de fpumtrr clflush mca pse36
 +   feature_ecx = sse3 ssse3
 +   extfeature_edx = fxsr mmx pat cmov pge apic cx8 mce pae msr tsc
 pse de fpulm syscall nx
 +   extfeature_ecx = lahf_lm
 Wouldn't it be much more user friendly to merge them all into one
 string? Just from the feature names it is quite obscure to guess which
 flag belongs into which string (especially since we lack the EXTn_
 prefix we had in helper.c). I haven't tried it, but the parsing code
 looks like this shouldn't be too hard.
 To avoid overlong lines one could think about a += operator.

That's true.  Although I expect setup of a cpu model to
be a rather infrequent occurrence by the expert (+/-)
user so the above didn't strike me as a significant issue.
Also -cpu ?cpuid dumps out the entire motley crew of
flags relative to each grouping for reference.

That said the current config file syntax seems rather
rigid and I think your suggestion makes sense.  I avoided
modifying the parser at this point just in the interest of
minimizing the sprawl of this patch.

 I would just drop all definitions here except qemu{32,64} and
 kvm{32,64}. The other models should be described in the config file.

That's the goal but I wanted to leave an interim firewall
of sorts.  If the target-x86_64.conf isn't installed for
whatever reason, qemu still can fall back to the internal
definitions.  Even here it isn't strictly necessary to
remove an internal def as it can be redefined in the
config file which will override the internal version.
In general -cpu ?model will indicate internal vs.
externally defined models by enclosing internal model names
in brackets:

:
x86   Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)
:
x86 [athlon]  QEMU Virtual CPU version 0.12.50
:

It also seems worth dropping a hint to the user in the case qemu
fails to find a target config file rather than leaving them to
puzzle out why an external model has gone missing.

Thanks for the feedback.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add cpu model configuration support.. (resend)

2010-02-01 Thread john cooper
[target-x86_64.conf was unintentionally omitted from the earlier patch]

This is a reimplementation of prior versions which adds
the ability to define cpu models for contemporary processors.
The added models are likewise selected via -cpu name,
and are intended to displace the existing convention
of -cpu qemu64 augmented with a series of feature flags.

A primary motivation was determination of a least common
denominator within a given processor class to simplify guest
migration.  It is still possible to modify an arbitrary model
via additional feature flags however the goal here was to
make doing so unnecessary in typical usage.  The other
consideration was providing models names reflective of
current processors.  Both AMD and Intel have reviewed the
models in terms of balancing generality of migration vs.
excessive feature downgrade relative to released silicon. 

This version of the patch replaces the prior hard wired
definitions with a configuration file approach for new
models.  Existing models are thus far left as-is but may
easily be transitioned to (or may be overridden by) the
configuration file representation.

Proposed new model definitions are provided here for current
AMD and Intel processors.  Each model consists of a name
used to select it on the command line (-cpu name), and a
model_id which corresponds to a least common denominator
commercial instance of the processor class.

A table of names/model_ids may be queried via -cpu ?model:

:
x86   Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)  
x86   Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)  
x86   Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)   
x86  Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)   
x86   Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)
x86   Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)
: 

Also added is -cpu ?dump which exhaustively outputs all config
data for all defined models, and -cpu ?cpuid which enumerates
all qemu recognized CPUID feature flags.

The pseudo cpuid flag 'check' when added to the feature flag list
will warn when feature flags (either implicit in a cpu model or
explicit on the command line) would have otherwise been quietly
unavailable to a guest:

# qemu-system-x86_64 ... -cpu Nehalem,check
warning: host cpuid _0001 lacks requested flag 'sse4.2|sse4_2' 
[0x0010]
warning: host cpuid _0001 lacks requested flag 'popcnt' [0x0080]

A similar 'enforce' pseudo flag exists which in addition
to the above causes qemu to error exit if requested flags are
unavailable.

Configuration data for a cpu model resides in the target config
file which by default will be installed as:

/usr/local/etc/qemu/target-arch.conf

The format of this file should be self explanatory given the
definitions for the above six models and essentially mimics
the structure of the static x86_def_t x86_defs.

Encoding of cpuid flags names now allows aliases for both the
configuration file and the command line which reconciles some
Intel/AMD/Linux/Qemu naming differences.

This patch was tested relative to qemu.git.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/Makefile b/Makefile
index 3848627..b7fa6ef 100644
--- a/Makefile
+++ b/Makefile
@@ -191,7 +191,11 @@ ifdef CONFIG_POSIX
$(INSTALL_DATA) qemu-nbd.8 $(DESTDIR)$(mandir)/man8
 endif
 
-install: all $(if $(BUILD_DOCS),install-doc)
+install-sysconfig:
+   $(INSTALL_DIR) $(sysconfdir)/qemu
+   $(INSTALL_DATA) sysconfigs/target/target-x86_64.conf 
$(sysconfdir)/qemu
+
+install: all $(if $(BUILD_DOCS),install-doc) install-sysconfig
$(INSTALL_DIR) $(DESTDIR)$(bindir)
 ifneq ($(TOOLS),)
$(INSTALL_PROG) $(STRIP_OPT) $(TOOLS) $(DESTDIR)$(bindir)
diff --git a/qemu-config.c b/qemu-config.c
index c3203c8..246fae6 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -242,6 +242,54 @@ QemuOptsList qemu_mon_opts = {
 },
 };
 
+QemuOptsList qemu_cpudef_opts = {
+.name = cpudef,
+.head = QTAILQ_HEAD_INITIALIZER(qemu_cpudef_opts.head),
+.desc = {
+{
+.name = name,
+.type = QEMU_OPT_STRING,
+},{
+.name = level,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = vendor,
+.type = QEMU_OPT_STRING,
+},{
+.name = family,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = model,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = stepping,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = feature_edx,  /* cpuid _0001.edx */
+.type = QEMU_OPT_STRING,
+},{
+.name = feature_ecx,  /* cpuid _0001.ecx */
+.type = QEMU_OPT_STRING,
+},{
+.name = extfeature_edx,   /* cpuid 8000_0001.edx */
+.type = QEMU_OPT_STRING

[PATCH] Add cpu model configuration support..

2010-01-31 Thread john cooper
This is a reimplementation of prior versions which adds
the ability to define cpu models for contemporary processors.
The added models are likewise selected via -cpu name,
and are intended to displace the existing convention
of -cpu qemu64 augmented with a series of feature flags.

A primary motivation was determination of a least common
denominator within a given processor class to simplify guest
migration.  It is still possible to modify an arbitrary model
via additional feature flags however the goal here was to
make doing so unnecessary in typical usage.  The other
consideration was providing models names reflective of
current processors.  Both AMD and Intel have reviewed the
models in terms of balancing generality of migration vs.
excessive feature downgrade relative to released silicon. 

This version of the patch replaces the prior hard wired
definitions with a configuration file approach for new
models.  Existing models are thus far left as-is but may
easily be transitioned to (or may be overridden by) the
configuration file representation.

Proposed new model definitions are provided here for current
AMD and Intel processors.  Each model consists of a name
used to select it on the command line (-cpu name), and a
model_id which corresponds to a least common denominator
commercial instance of the processor class.

A table of names/model_ids may be queried via -cpu ?model:

:
x86   Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)  
x86   Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)  
x86   Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)   
x86  Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)   
x86   Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)
x86   Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)
: 

Also added is -cpu ?dump which exhaustively outputs all config
data for all defined models, and -cpu ?cpuid which enumerates
all qemu recognized CPUID feature flags.

The pseudo cpuid flag 'check' when added to the feature flag list
will warn when feature flags (either implicit in a cpu model or
explicit on the command line) would have otherwise been quietly
unavailable to a guest:

# qemu-system-x86_64 ... -cpu Nehalem,check
warning: host cpuid _0001 lacks requested flag 'sse4.2|sse4_2' 
[0x0010]
warning: host cpuid _0001 lacks requested flag 'popcnt' [0x0080]

A similar 'enforce' pseudo flag exists which in addition
to the above causes qemu to error exit if requested flags are
unavailable.

Configuration data for a cpu model resides in the target config
file which by default will be installed as:

/usr/local/etc/qemu/target-arch.conf

The format of this file should be self explanatory given the
definitions for the above six models and essentially mimics
the structure of the static x86_def_t x86_defs.

Encoding of cpuid flags names now allows aliases for both the
configuration file and the command line which reconciles some
Intel/AMD/Linux/Qemu naming differences.

This patch was tested relative to qemu.git.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/Makefile b/Makefile
index 3848627..b7fa6ef 100644
--- a/Makefile
+++ b/Makefile
@@ -191,7 +191,11 @@ ifdef CONFIG_POSIX
$(INSTALL_DATA) qemu-nbd.8 $(DESTDIR)$(mandir)/man8
 endif
 
-install: all $(if $(BUILD_DOCS),install-doc)
+install-sysconfig:
+   $(INSTALL_DIR) $(sysconfdir)/qemu
+   $(INSTALL_DATA) sysconfigs/target/target-x86_64.conf 
$(sysconfdir)/qemu
+
+install: all $(if $(BUILD_DOCS),install-doc) install-sysconfig
$(INSTALL_DIR) $(DESTDIR)$(bindir)
 ifneq ($(TOOLS),)
$(INSTALL_PROG) $(STRIP_OPT) $(TOOLS) $(DESTDIR)$(bindir)
diff --git a/qemu-config.c b/qemu-config.c
index c3203c8..246fae6 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -242,6 +242,54 @@ QemuOptsList qemu_mon_opts = {
 },
 };
 
+QemuOptsList qemu_cpudef_opts = {
+.name = cpudef,
+.head = QTAILQ_HEAD_INITIALIZER(qemu_cpudef_opts.head),
+.desc = {
+{
+.name = name,
+.type = QEMU_OPT_STRING,
+},{
+.name = level,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = vendor,
+.type = QEMU_OPT_STRING,
+},{
+.name = family,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = model,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = stepping,
+.type = QEMU_OPT_NUMBER,
+},{
+.name = feature_edx,  /* cpuid _0001.edx */
+.type = QEMU_OPT_STRING,
+},{
+.name = feature_ecx,  /* cpuid _0001.ecx */
+.type = QEMU_OPT_STRING,
+},{
+.name = extfeature_edx,   /* cpuid 8000_0001.edx */
+.type = QEMU_OPT_STRING,
+},{
+.name = extfeature_ecx,   /* cpuid 8000_0001.ecx

Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-21 Thread john cooper
Anthony Liguori wrote:
 On 01/20/2010 07:18 PM, john cooper wrote: 
 I can appreciate the concern of wanting to get this
 as correct as possible.

 
 This is the root of the trouble.  At the qemu layer, we try to focus on
 being correct.
 
 Management tools are typically the layer that deals with being correct.
 
 A good compromise is making things user tunable which means that a
 downstream can make correctness decisions without forcing those
 decisions on upstream.

Conceptually I agree with such a malleable approach -- actually
I prefer it.  I thought however it was too much infrastructure to
foist on the problem just to add a few more models into the mix.

The only reservation which comes to mind is that of logistics.
This may ruffle the code some and impact others such as Andre
who seem to have existing patches relative to the current structure.
Anyone have strong objections to this approach before I have a
look at an implementation?

Thanks,

-john


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-21 Thread john cooper
Jamie Lokier wrote:

 Do you mean that more powerful management tools to support safe
 migration will maintain _their own_ processor model tables, and
 perform their calculations using their own tables instead of querying
 qemu, and therefore not have any need of qemu's built in table?

I would expect so.  IIRC that is what the libvirt folks have
in mind for example.  But we're also trying to simplify the use
case of the lonesome user at one with the qemu CLI.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-21 Thread john cooper
Jamie Lokier wrote:

 I think we can all agree that there is no point looking for a familiar
 -cpu naming scheme because there aren't any familiar and meaningful names
 these days.

Even if we dismiss the Intel coined names as internal
code names, there is still VMW's use of them in this
space which we can either align with or attempt to
displace.   All considered I don't see any motivation
nor gain in doing the latter.  Anyway it doesn't appear
likely we're going to resolve this to our collective
satisfaction with a hard-wired naming scheme.   
 
 It would be nice if qemu could tell the user which of the built-in
 -cpu choices is the most featureful subset of their own host.  With
 -cpu host implemented, finding that is probably quite easy.

This should be doable although it may not be as simple
as traversing a hierarchy of features and picking one
with the most host flags present.  In any case this
should be fairly detachable from settling the immediate
issue.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-20 Thread john cooper
Jamie Lokier wrote:
 Anthony Liguori wrote:
 On 01/18/2010 10:45 AM, john cooper wrote:
 x86   Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)
 x86   Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)
 x86  Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)
 x86   Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)
 x86   Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)
 x86   Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)
 I'm very much against having -cpu Nehalem.  The whole point of this is 
 to make things easier for a user and for most of the users I've 
 encountered, -cpu Nehalem is just as obscure as -cpu qemu64,-sse3,+vmx,...
 
 When I saw that table just now, I had no idea whether Nehalem is newer
 and more advanced than Penryn, or the other way around.  I also have
 no idea if Core i7 is newer than Core 2 Duo or not.

I can appreciate the argument above, however the goal was
choosing names with some basis in reality.  These were
recommended by our contacts within Intel, are used by VmWare
to describe their similar cpu models, and arguably have fallen
to defacto usage as evidenced by such sources as:

http://en.wikipedia.org/wiki/Conroe_(microprocessor)
http://en.wikipedia.org/wiki/Penryn_(microprocessor)
http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

I suspect whatever we choose of reasonable length as a model
tag for -cpu some further detail is going to be required.
That was the motivation to augment the table as above with
an instance of a LCD for that associated class.
 
 I'm not a typical user: I know quite a lot about x86 architecture;
 I just haven't kept up to date enough to know the code/model names.
 Typical users will know less about them.

Understood.  One thought I had to further clarify what is going
on under the hood was to dump the cpuid flags for each model as
part of (or in addition to) the above table.  But this seems a
bit extreme and kvm itself can modify flags exported from qemu
to a guest.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-20 Thread john cooper
Anthony Liguori wrote:
 On 01/19/2010 02:03 PM, Chris Wright wrote:
 * Anthony Liguori (anth...@codemonkey.ws) wrote:
   
 I'm very much against having -cpu Nehalem.  The whole point of this is
 to make things easier for a user and for most of the users I've
 encountered, -cpu Nehalem is just as obscure as -cpu
 qemu64,-sse3,+vmx,...
  
 What name will these users know?  FWIW, it makes sense to me as it is.

 
 Whatever is in /proc/cpuinfo.

$ grep name /proc/cpuinfo
model name  : Intel(R) Core(TM)2 Duo CPU E8400  @ 3.00GHz

Which is detailing that exact cpu vs. the class
of which it is a member.  So are you suggesting
to map all instances of processors called out
in /proc/cpuinfo into one of the three defined
models?  We can certainly do that however I was
looking for a more terse and simplified solution
at this level while deferring more ornate mapping
schemes to management tools.

Still at the user facing CLI this doesn't strike
me as the most friendly encoding of a -cpu name.  

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-20 Thread john cooper
Jamie Lokier wrote:
 john cooper wrote:
 As before a cpu feature 'check' option is added which warns when
 feature flags (either implicit in a cpu model or explicit on the
 command line) would have otherwise been quietly unavailable to a
 guest:

 # qemu-system-x86_64 ... -cpu Nehalem,check
 warning: host cpuid _0001 lacks requested flag 'sse4.2' [0x0010]
 warning: host cpuid _0001 lacks requested flag 'popcnt' [0x0080]
 
 That's a nice feature.  Can we have a 'checkfail' option which refuses
 to run if a requested capability isn't available?  Thanks.

Certainly, others have requested the same.  Let's resolve
the issue at hand first.

 I foresee wanting to iterate over the models and pick the latest one
 which a host supports - on the grounds that you have done the hard
 work of ensuring it is a reasonably good performer, while probably
 working on another host of similar capability when a new host is made
 available.

That's a fairly close use case to that of safe migration
which was one of the primary motivations to identify
the models being discussed.  Although presentation and
administration of such was considered the domain of management
tools.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..

2010-01-20 Thread john cooper
Chris Wright wrote:
 * Daniel P. Berrange (berra...@redhat.com) wrote:
 To be honest all possible naming schemes for '-cpu name' are just as
 unfriendly as each other. The only user friendly option is '-cpu host'. 

 IMHO, we should just pick a concise naming scheme  document it. Given
 they are all equally unfriendly, the one that has consistency with vmware
 naming seems like a mild winner.
 
 Heh, I completely agree, and was just saying the same thing to John
 earlier today.  May as well be -cpu {foo,bar,baz} since the meaning for
 those command line options must be well-documented in the man page.

I can appreciate the concern of wanting to get this
as correct as possible.  But ultimately we just
need three unique tags which ideally have some relation
to their associated architectures.  The diatribes
available from /proc/cpuinfo while generally accurate
don't really offer any more of a clue to the model
group, and in their unmodified form are rather unwieldy
as command line flags.

 This is from an EVC kb article[1]:

Here is a pointer to a more detailed version:

   
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1003212


We probably should also add an option to dump out the
full set of qemu-side cpuid flags for the benefit of
users and upper level tools.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add definitions for current cpu models..

2010-01-18 Thread john cooper
This is a rework of the prior version which adds definitions
for contemporary processors selected via -cpu model, as an
alternative to the existing use of -cpu qemu64 augmented
with a series of feature flags.

The primary motivation was determination of a least common
denominator within a given processor class to simplify guest
migration.  It is still possible to modify an arbitrary model
via additional feature flags however the goal here was to
make doing so unnecessary in typical usage.  The other
consideration was providing models names reflective of
current processors.  Both AMD and Intel have reviewed the
models in terms of balancing generality of migration vs.
excessive feature downgrade relative to released silicon. 

Concerning the prior version of the patch, the proposed name
used for a given model drew a fair amount of debate, the
main concern being use of names as mnemonic as possible to
the wisest group of users.  Another suggestion was to use
the vendor name of released silicon corresponding to a least
common denominator CPU within the class, rational being doing
so is more definitive of the intended functionality.  However
something like:

 -cpu Intel Core 2 Duo P9xxx

probably isn't all that easy to remember nor type when
selecting a Penryn class cpu.  So I struck what I believe to
be a reasonable compromise where the original x86_def_t.name
was for the most part retained with the x86_def_t.model_id
capturing the marketing name of the cpu being used as the
least common denominator for the class.  To make it easier for
a user to associate a *.name with *.model_id, -cpu ? invoked
rather as -cpu ?? will append *.model_id to the generated
table:

:
x86   Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)   
x86   Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)
x86  Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)   
x86   Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)   
x86   Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)  
x86   Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)
: 

As before a cpu feature 'check' option is added which warns when
feature flags (either implicit in a cpu model or explicit on the
command line) would have otherwise been quietly unavailable to a
guest:

# qemu-system-x86_64 ... -cpu Nehalem,check
warning: host cpuid _0001 lacks requested flag 'sse4.2' [0x0010]
warning: host cpuid _0001 lacks requested flag 'popcnt' [0x0080]

This patch was tested relative to qemu.git.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index f3834b3..58400ab 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -722,8 +722,8 @@ typedef struct CPUX86State {
 CPUX86State *cpu_x86_init(const char *cpu_model);
 int cpu_x86_exec(CPUX86State *s);
 void cpu_x86_close(CPUX86State *s);
-void x86_cpu_list (FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt,
- ...));
+void x86_cpu_list (FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...),
+const char *optarg);
 int cpu_get_pic_interrupt(CPUX86State *s);
 /* MSDOS compatibility mode FPU exception support */
 void cpu_set_ferr(CPUX86State *s);
@@ -875,7 +875,7 @@ uint64_t cpu_get_tsc(CPUX86State *env);
 #define cpu_exec cpu_x86_exec
 #define cpu_gen_code cpu_x86_gen_code
 #define cpu_signal_handler cpu_x86_signal_handler
-#define cpu_list x86_cpu_list
+#define cpu_list_id x86_cpu_list
 
 #define CPU_SAVE_VERSION 11
 
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 730e396..34f4936 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -42,7 +42,7 @@ static const char *feature_name[] = {
 static const char *ext_feature_name[] = {
 pni /* Intel,AMD sse3 */, NULL, NULL, monitor, ds_cpl, vmx, NULL 
/* Linux smx */, est,
 tm2, ssse3, cid, NULL, NULL, cx16, xtpr, NULL,
-NULL, NULL, dca, NULL, NULL, NULL, NULL, popcnt,
+NULL, NULL, dca, sse4.1, sse4.2, x2apic, NULL, popcnt,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, hypervisor,
 };
 static const char *ext2_feature_name[] = {
@@ -58,6 +58,18 @@ static const char *ext3_feature_name[] = {
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 };
 
+/* collects per-function cpuid data
+ */
+typedef struct model_features_t {
+uint32_t *guest_feat;
+uint32_t *host_feat;
+uint32_t check_feat;
+const char **flag_names;
+uint32_t cpuid;
+} model_features_t;
+
+int check_cpuid = 0;
+
 static void add_flagname_to_bitmaps(const char *flagname, uint32_t *features,
 uint32_t *ext_features,
 uint32_t *ext2_features,
@@ -111,10 +123,25 @@ typedef struct x86_def_t {
   CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \
   CPUID_PSE36 | CPUID_FXSR

Re: [PATCH] Add definitions for current cpu models..

2010-01-04 Thread john cooper
Marcelo Tosatti wrote:
 On Mon, Dec 21, 2009 at 01:46:36AM -0500, john cooper wrote:
 +{
 +.name = Opteron_G2,
 +.level = 5,
 +.vendor1 = CPUID_VENDOR_INTEL_1,
 +.vendor2 = CPUID_VENDOR_INTEL_2,
 +.vendor3 = CPUID_VENDOR_INTEL_3,
 
 Silly question: why a CPU named Opteron_G2 uses intel vendor id's?

The feedback I had from AMD indicated using the Intel
strings for a family 15 cpu resulted in the least
amount of guest confusion.  The upstream kvm64 model
does similarly.

Sorry for my late response here which was preempted
by the intervening holiday.

-john

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add definitions for current cpu models..

2009-12-20 Thread john cooper
This adds definitions for contemporary processors
which may be selected via -cpu model, as an
alternative to the existing use of -cpu qemu64
augmented with a series of feature flags.

The primary motivation was determination of a
least common denominator within a given processor
class for simplification of guest migration.  It
is still possible to modify an arbitrary model via
additional feature flags however the goal here was
to make doing so unnecessary in typical usage.  The
other consideration was providing models names
reflective of current processors.  Both AMD and
Intel have reviewed the models in terms of balancing
generality of migration vs. excessive feature
downgrade relative to released silicon. 

A cpu feature 'check' option is also added which
warns when feature flags (either implicit in a cpu
model or explicit on the command line) would have
otherwise been quietly disabled for a guest.

This patch was tested relative to qemu-kvm.git.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/target-i386/helper.c b/target-i386/helper.c
index 9a50da6..a706cae 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -44,7 +44,7 @@ static const char *feature_name[] = {
 static const char *ext_feature_name[] = {
 pni /* Intel,AMD sse3 */, NULL, NULL, monitor, ds_cpl, vmx, NULL 
/* Linux smx */, est,
 tm2, ssse3, cid, NULL, NULL, cx16, xtpr, NULL,
-NULL, NULL, dca, NULL, NULL, x2apic, NULL, popcnt,
+NULL, NULL, dca, sse4.1, sse4.2, x2apic, NULL, popcnt,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, hypervisor,
 };
 static const char *ext2_feature_name[] = {
@@ -60,6 +60,18 @@ static const char *ext3_feature_name[] = {
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 };
 
+/* collects per-function cpuid data
+ */
+typedef struct model_features_t {
+uint32_t *guest_feat;
+uint32_t *host_feat;
+uint32_t check_feat;
+const char **flag_names;
+uint32_t cpuid;
+} model_features_t;
+
+int check_cpuid = 0;
+
 static void add_flagname_to_bitmaps(const char *flagname, uint32_t *features,
 uint32_t *ext_features,
 uint32_t *ext2_features,
@@ -171,6 +183,139 @@ static x86_def_t x86_defs[] = {
 .model_id = AMD Phenom(tm) 9550 Quad-Core Processor
 },
 {
+.name = Merom,
+.level = 2,
+.vendor1 = CPUID_VENDOR_INTEL_1,
+.vendor2 = CPUID_VENDOR_INTEL_2,
+.vendor3 = CPUID_VENDOR_INTEL_3,
+.family = 6,   /* P6 */
+.model = 2,
+.stepping = 3,
+.features = PPRO_FEATURES | 
+/* these features are needed for Win64 and aren't fully implemented */
+CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |
+/* this feature is needed for Solaris and isn't fully implemented */
+CPUID_PSE36,
+.ext_features = CPUID_EXT_SSE3,/* from qemu64 */
+.ext2_features = (PPRO_FEATURES  0x0183F3FF) | 
+CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
+.ext3_features = CPUID_EXT3_SVM,   /* from qemu64 */
+.xlevel = 0x800A,
+.model_id = Intel Merom Core 2,
+},
+{
+.name = Penryn,
+.level = 2,
+.vendor1 = CPUID_VENDOR_INTEL_1,
+.vendor2 = CPUID_VENDOR_INTEL_2,
+.vendor3 = CPUID_VENDOR_INTEL_3,
+.family = 6,   /* P6 */
+.model = 2,
+.stepping = 3,
+.features = PPRO_FEATURES | 
+/* these features are needed for Win64 and aren't fully implemented */
+CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |
+/* this feature is needed for Solaris and isn't fully implemented */
+CPUID_PSE36,
+.ext_features = CPUID_EXT_SSE3 |
+CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41,
+.ext2_features = (PPRO_FEATURES  0x0183F3FF) | 
+CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
+.ext3_features = CPUID_EXT3_SVM,
+.xlevel = 0x800A,
+.model_id = Intel Penryn Core 2,
+},
+{
+.name = Nehalem,
+.level = 2,
+.vendor1 = CPUID_VENDOR_INTEL_1,
+.vendor2 = CPUID_VENDOR_INTEL_2,
+.vendor3 = CPUID_VENDOR_INTEL_3,
+.family = 6,   /* P6 */
+.model = 2,
+.stepping = 3,
+.features = PPRO_FEATURES | 
+/* these features are needed for Win64 and aren't fully implemented */
+CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |
+/* this feature is needed for Solaris and isn't fully implemented */
+CPUID_PSE36,
+.ext_features = CPUID_EXT_SSE3 |
+CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41 |
+CPUID_EXT_SSE42 | CPUID_EXT_POPCNT,
+.ext2_features = (PPRO_FEATURES  0x0183F3FF) | 
+CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
+.ext3_features = CPUID_EXT3_SVM,
+.xlevel = 0x800A,
+.model_id

Re: virtio disk slower than IDE?

2009-11-16 Thread john cooper
[prior attempts from elsewhere kept bouncing, apologies for any replication]

Gordan Bobic wrote:
 The test is building the Linux kernel (only taking the second run to give the 
 test the benefit of local cache):

 make clean; make -j8 all; make clean; sync; time make -j8 all

 This takes about 10 minutes with IDE disk emulation and about 13 minutes with 
 virtio. I ran the tests multiple time with most non-essential services on the 
 host switched off (including cron/atd), and the guest in single-user mode to 
 reduce the noise in the test to the minimum, and the results are pretty 
 consistent, with virtio being about 30% behind.

I'd expect for an observed 30% wall clock time difference
of an operation as complex as a kernel build the base i/o
throughput disparity is substantially greater.  Did you
try a more simple/regular load, eg: a streaming dd read
of various block sizes from guest raw disk devices?
This is also considerably easier to debug vs. the complex
i/o load generated by a build.

One way to chop up the problem space is using blktrace
on the host to observe both the i/o patterns coming out
of qemu and the host's response to them in terms of
turn around time.  I expect you'll see somewhat different
nature requests generated by qemu w/r/t blocking and
number of threads serving virtio_blk requests relative
to ide but the host response should be essentially the
same in terms of data returned per unit time.

If the host looks to be turning around i/o request with
similar latency in both cases, the problem would be lower
frequency of requests generated by qemu in the case of
virtio_blk.   Here it would be useful to know the host
load generated by the guest for both cases.

-john


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio disk slower than IDE?

2009-11-16 Thread john cooper

Gordan Bobic wrote:
The test is building the Linux kernel (only taking the second run to 
give the test the benefit of local cache):


make clean; make -j8 all; make clean; sync; time make -j8 all

This takes about 10 minutes with IDE disk emulation and about 13 minutes 
with virtio. I ran the tests multiple time with most non-essential 
services on the host switched off (including cron/atd), and the guest in 
single-user mode to reduce the noise in the test to the minimum, and 
the results are pretty consistent, with virtio being about 30% behind.


I'd expect for an observed 30% wall clock time difference
of an operation as complex as a kernel build the base i/o
throughput disparity is substantially greater.  Did you
try a more simple/regular load, eg: a streaming dd read
of various block sizes from guest raw disk devices?
This is also considerably easier to debug vs. the complex
i/o load generated by a build.

One way to chop up the problem space is using blktrace
on the host to observe both the i/o patterns coming out
of qemu and the host's response to them in terms of
turn around time.  I expect you'll see somewhat different
nature requests generated by qemu w/r/t blocking and
number of threads serving virtio_blk requests relative
to ide but the host response should be essentially the
same in terms of data returned per unit time.

If the host looks to be turning around i/o request with
similar latency in both cases, the problem would be lower
frequency of requests generated by qemu in the case of
virtio_blk.   Here it would be useful to know the host
load generated by the guest for both cases.

-john


--
john.coo...@third-harmonic.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM-88 broke VirtIO Hard Disks

2009-07-30 Thread john cooper

Anthony Liguori wrote:

Marcelo Tosatti wrote:

bf011293f is an easy one to blame, can you revert it and check,
please?
  


Has anyone tracked down a proper fix?

Apologies, I'd been distracted elsewhere.

I suspect transferring the identify page in
its entirety via the config space is somehow
confusing the windows virtio driver although
it isn't immediately clear why.

Investigating it now.

-john

--
john.coo...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Add serial number support for virtio_blk, V4

2009-06-03 Thread john cooper
Jens Axboe wrote:
 On Mon, Jun 01 2009, Rusty Russell wrote:
 On Fri, 29 May 2009 01:45:27 pm john cooper wrote:
 virtio_blk-serial-4.patch
 Hate to ask dumb questions, but is there a scsi equivalent of this?  It'd be 
 nice if we could avoid being ATA-specific in the long run...
 
 SCSI has mode pages, where ATA pretty much stuffs everything into the
 identify data.
 
 Also, why u16?
 
 The identify page is word based, so u16 makes sense.

It is actually an artifact left over from the previous
version.  In that case the driver was only pulling
the serial number over, itself constructing a mocked-up
identify page, and needed to be aware of the internal
structure.

Currently qemu fabricates the identify page which the
driver treats as opaque data passed onto the user.  So
the u16 * structure has no meaning.

Sorry to be late in responding here.  A patch to address
this wrinkle and other feedback to follow shortly.

-john


 +/* return ATA identify data
 + */
 +static int virtblk_identify(struct gendisk *disk, void *argp)
 +{
 +   struct virtio_blk *vblk = disk-private_data;
 +   u16 *id;
 +   int err = -ENOMEM;
 +
 +   id = kmalloc(VIRTIO_BLK_ID_BYTES, GFP_KERNEL);
 +   if (!id)
 +   goto out;
 +
 +   err = virtio_config_buf(vblk-vdev, VIRTIO_BLK_F_IDENTIFY,
 +   offsetof(struct virtio_blk_config, identify), id,
 +   VIRTIO_BLK_ID_BYTES);
 


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Add serial number support for virtio_blk, V4a

2009-06-03 Thread john cooper
This patch extracts the opaque data from pci i/o
region 0 via the added VIRTIO_BLK_F_IDENTIFY
field.  By convention this data takes the form of
that returned by an ATA IDENTIFY DEVICE command,
however the driver (except for structure size)
makes no interpretation of the data.  The structure
data is copied wholesale to userspace via a
HDIO_GET_IDENTITY ioctl command (eg: hdparm -i dev).

Signed-off-by: john cooper john.coo...@redhat.com
---

 drivers/block/virtio_blk.c |   41 ++---
 include/linux/virtio_blk.h |6 ++
 2 files changed, 44 insertions(+), 3 deletions(-)
=
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -146,12 +146,46 @@ static void do_virtblk_request(struct re
vblk-vq-vq_ops-kick(vblk-vq);
 }
 
+/* return ATA identify data
+ */
+static int virtblk_identify(struct gendisk *disk, void *argp)
+{
+   struct virtio_blk *vblk = disk-private_data;
+   void *opaque;
+   int err = -ENOMEM;
+
+   opaque = kmalloc(VIRTIO_BLK_ID_BYTES, GFP_KERNEL);
+   if (!opaque)
+   goto out;
+
+   err = virtio_config_buf(vblk-vdev, VIRTIO_BLK_F_IDENTIFY,
+   offsetof(struct virtio_blk_config, identify), opaque,
+   VIRTIO_BLK_ID_BYTES);
+
+   if (err)
+   goto out_kfree;
+
+   if (copy_to_user(argp, opaque, VIRTIO_BLK_ID_BYTES))
+   err = -EFAULT;
+
+out_kfree:
+   kfree(opaque);
+out:
+   return err;
+}
+
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 unsigned cmd, unsigned long data)
 {
-   return scsi_cmd_ioctl(bdev-bd_disk-queue,
- bdev-bd_disk, mode, cmd,
- (void __user *)data);
+   struct gendisk *disk = bdev-bd_disk;
+   void __user *argp = (void __user *)data;
+
+   switch (cmd) {
+   case HDIO_GET_IDENTITY:
+   return virtblk_identify(disk, argp);
+   default:
+   return scsi_cmd_ioctl(disk-queue, disk, mode, cmd, argp);
+   }
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
@@ -356,6 +390,7 @@ static struct virtio_device_id id_table[
 static unsigned int features[] = {
VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+   VIRTIO_BLK_F_IDENTIFY
 };
 
 static struct virtio_driver virtio_blk = {
=
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -15,7 +15,12 @@
 #define VIRTIO_BLK_F_GEOMETRY  4   /* Legacy geometry available  */
 #define VIRTIO_BLK_F_RO5   /* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE  6   /* Block size of disk is available*/
+#define VIRTIO_BLK_F_IDENTIFY  8   /* ATA IDENTIFY supported */
 
+#define VIRTIO_BLK_ID_BYTES(sizeof (__u16[256]))   /* IDENTIFY DATA */
+
+/* mapped into pci i/o region 0
+ */
 struct virtio_blk_config
 {
/* The capacity (in 512-byte sectors). */
@@ -32,6 +37,7 @@ struct virtio_blk_config
} geometry;
/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
__u32 blk_size;
+   __u8 identify[VIRTIO_BLK_ID_BYTES];
 } __attribute__((packed));
 
 /* These two define direction. */

-- 
john.coo...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: configure script bug..

2009-06-01 Thread john cooper
Avi Kivity wrote:
 john cooper wrote:
 Hit this yesterday when configure hung attempting
 to pull the version from a kernel's .config.

   
 
 Looks right, but missing a signoff.


Signed-off-by: john cooper john.coo...@redhat.com

diff --git a/configure b/configure
index 493c178..1fd133c 100755
--- a/configure
+++ b/configure
@@ -126,7 +126,7 @@ if [ -n $no_uname ]; then
 elif [ -e $kerneldir/include/config/kernel.release ]; then
 depmod_version=`cat $kerneldir/include/config/kernel.release`
 elif [ -e $kerneldir/.config ]; then
-   depmod_version=$(awk '/Linux kernel version:/ { print $NF }'
+   depmod_version=$(awk '/Linux kernel version:/ { print $NF }' \
 $kerneldir/.config)
 else
 echo

-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


configure script bug..

2009-05-30 Thread john cooper
Hit this yesterday when configure hung attempting
to pull the version from a kernel's .config.


diff --git a/configure b/configure
index 493c178..1fd133c 100755
--- a/configure
+++ b/configure
@@ -126,7 +126,7 @@ if [ -n $no_uname ]; then
 elif [ -e $kerneldir/include/config/kernel.release ]; then
 depmod_version=`cat $kerneldir/include/config/kernel.release`
 elif [ -e $kerneldir/.config ]; then
-   depmod_version=$(awk '/Linux kernel version:/ { print $NF }'
+   depmod_version=$(awk '/Linux kernel version:/ { print $NF }' \
 $kerneldir/.config)
 else
 echo


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] Add serial number support for virtio_blk, V3

2009-05-28 Thread john cooper

 +if (!(id = kzalloc(ATA_ID_WORDS, GFP_KERNEL)))
 +rv = -ENOMEM;
 
 Doesn't ATA_ID_WORDS seem like a strange name for a number of bytes?

Yes I caught that bug in the rework as well.

 What's this *for* BTW?

Sorry -- I assumed you were on either list.
Please see patch to follow.

-john

-- 
john.coo...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] Add serial number support for virtio_blk, V3

2009-05-28 Thread john cooper
Christoph Hellwig wrote:
 On Wed, May 27, 2009 at 09:49:19AM +0200, Christoph Hellwig wrote:
 /*
  * IDE-compatible identify ioctl.
  *
  * Currenlyt only returns the serial number and leaves all other fields
  * zero.
  */
 
 Btw, thinking about it the rest of the information in the ioctl should
 probably be filled up with faked data, similar to how we do it for
 the ide emulation inside qemu.

Agreed.  I've done so to the extent it makes sense
in the case of the more generic fields.  A fair
amount of the identify information being generated
by hw/ide.c appears obsoleted.

-john

-- 
john.coo...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Add serial number support for virtio_blk, V4

2009-05-28 Thread john cooper
[Rework of earlier patch to provide additional
information in the response to an ATA identify
request -- virtio_blk treats the data as opaque,
content created by qemu's virtio-blk.  Comments
from Christoph also incorporated.]

This patch allows passing of a virtio_blk drive
serial number from qemu into a guest's virtio_blk
driver, and provides a means to access the serial
number from a guest's userspace.

Equivalent functionality currently exists for IDE
and SCSI, however it is not yet implemented for
virtio.  Scenarios exist where guest code relies
on a unique drive serial number to correctly
identify the machine environment in which it
exists.

The following two patches implement the above:

  qemu-vblk-serial-4.patch

which provides the qemu missing bits to interpret
a '-drive .. serial=XYZ ..' flag, and:

  virtio_blk-serial-4.patch

which extracts this information and makes it
available to guest userspace via an HDIO_GET_IDENTITY
ioctl, eg: 'hdparm -i /dev/vda'.

The above patches are relative to qemu-kvm.git and
2.6.29.3 respectively.

-john

-- 
john.coo...@redhat.com



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Add serial number support for virtio_blk, V4

2009-05-28 Thread john cooper
qemu-vblk-serial-4.patch



diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 8dd3c7a..0b7ebe9 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -25,6 +25,7 @@ typedef struct VirtIOBlock
 BlockDriverState *bs;
 VirtQueue *vq;
 void *rq;
+char serial_str[BLOCK_SERIAL_STRLEN + 1];
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -285,6 +286,50 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 qemu_aio_flush();
 }
 
+/* store identify data in little endian format
+ */
+static inline void put_le16(uint16_t *p, unsigned int v)
+{
+*p = cpu_to_le16(v);
+}
+
+/* copy to *dst from *src, nul pad dst tail as needed to len bytes
+ */
+static inline void padstr(char *dst, const char *src, int len)
+{
+while (len--)
+*dst++ = *src ? *src++ : '\0';
+}
+
+/* setup simulated identify data as appropriate for virtio block device
+ *
+ * ref: AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS)
+ */
+static inline void virtio_identify_template(struct virtio_blk_config *bc)
+{
+uint16_t *p = bc-identify[0];
+uint64_t lba_sectors = bc-capacity;
+
+memset(p, 0, sizeof(bc-identify));
+put_le16(p + 0, 0x0);/* ATA device */
+padstr((char *)(p + 23), QEMU_VERSION, 8);   /* firmware revision */
+padstr((char *)(p + 27), QEMU VIRT_BLK, 40);   /* model# */
+put_le16(p + 47, 0x80ff);/* max xfer 255 sectors */
+put_le16(p + 49, 0x0b00);/* support IORDY/LBA/DMA 
*/
+put_le16(p + 59, 0x1ff); /* cur xfer 255 sectors */
+put_le16(p + 80, 0x1f0); /* support ATA8/7/6/5/4 */
+put_le16(p + 81, 0x16);
+put_le16(p + 82, 0x400);
+put_le16(p + 83, 0x400);
+put_le16(p + 100, lba_sectors);
+put_le16(p + 101, lba_sectors  16);
+put_le16(p + 102, lba_sectors  32);
+put_le16(p + 103, lba_sectors  48);
+}
+
+/* coalesce internal state, copy to pci i/o region 0
+ */
+
 static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIOBlock *s = to_virtio_blk(vdev);
@@ -299,11 +344,15 @@ static void virtio_blk_update_config(VirtIODevice *vdev, 
uint8_t *config)
 stw_raw(blkcfg.cylinders, cylinders);
 blkcfg.heads = heads;
 blkcfg.sectors = secs;
+virtio_identify_template(blkcfg);
+memcpy(blkcfg.identify[VIRTIO_BLK_ID_SN], s-serial_str,
+VIRTIO_BLK_ID_SN_BYTES);
 memcpy(config, blkcfg, sizeof(blkcfg));
 }
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 {
+VirtIOBlock *s = to_virtio_blk(vdev);
 uint32_t features = 0;
 
 features |= (1  VIRTIO_BLK_F_SEG_MAX);
@@ -311,6 +360,8 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 #ifdef __linux__
 features |= (1  VIRTIO_BLK_F_SCSI);
 #endif
+if (strcmp(s-serial_str, 0))
+features |= 1  VIRTIO_BLK_F_IDENTIFY;
 
 return features;
 }
@@ -354,6 +405,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev)
 int cylinders, heads, secs;
 static int virtio_blk_id;
 BlockDriverState *bs;
+char *ps;
 
 s = (VirtIOBlock *)virtio_common_init(virtio-blk, VIRTIO_ID_BLOCK,
   sizeof(struct virtio_blk_config),
@@ -365,6 +417,10 @@ VirtIODevice *virtio_blk_init(DeviceState *dev)
 s-vdev.reset = virtio_blk_reset;
 s-bs = bs;
 s-rq = NULL;
+if (strlen(ps = (char *)drive_get_serial(bs)))
+strncpy(s-serial_str, ps, sizeof(s-serial_str));
+else
+snprintf(s-serial_str, sizeof(s-serial_str), 0);
 bs-private = dev;
 bdrv_guess_geometry(s-bs, cylinders, heads, secs);
 bdrv_set_geometry_hint(s-bs, cylinders, heads, secs);
diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
index dff3e0c..1be4342 100644
--- a/hw/virtio-blk.h
+++ b/hw/virtio-blk.h
@@ -30,6 +30,11 @@
 #define VIRTIO_BLK_F_RO 5   /* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE   6   /* Block size of disk is available*/
 #define VIRTIO_BLK_F_SCSI   7   /* Supports scsi command passthru */
+#define VIRTIO_BLK_F_IDENTIFY   8   /* ATA IDENTIFY supported */
+
+#define VIRTIO_BLK_ID_LEN   256 /* length of identify u16 array */
+#define VIRTIO_BLK_ID_SN10  /* start of char * serial# */
+#define VIRTIO_BLK_ID_SN_BYTES  20  /* length in bytes of serial# */
 
 struct virtio_blk_config
 {
@@ -39,6 +44,8 @@ struct virtio_blk_config
 uint16_t cylinders;
 uint8_t heads;
 uint8_t sectors;
+uint32_t _blk_size;/* structure pad, currently unused */
+uint16_t identify[VIRTIO_BLK_ID_LEN];
 } __attribute__((packed));
 
 /* These two define direction. */
diff --git a/sysemu.h b/sysemu.h
index 47d001e..d3df19f 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -152,6 +152,8 @@ typedef enum {
 BLOCK_ERR_STOP_ANY
 } BlockInterfaceErrorAction;
 
+#define BLOCK_SERIAL_STRLEN 20
+
 typedef struct DriveInfo {
 BlockDriverState *bdrv;
 

[PATCH 2/2] Add serial number support for virtio_blk, V4

2009-05-28 Thread john cooper
virtio_blk-serial-4.patch



 drivers/block/virtio_blk.c |   41 ++---
 include/linux/virtio_blk.h |7 +++
 2 files changed, 45 insertions(+), 3 deletions(-)
=
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -146,12 +146,46 @@ static void do_virtblk_request(struct re
vblk-vq-vq_ops-kick(vblk-vq);
 }
 
+/* return ATA identify data
+ */
+static int virtblk_identify(struct gendisk *disk, void *argp)
+{
+   struct virtio_blk *vblk = disk-private_data;
+   u16 *id;
+   int err = -ENOMEM;
+
+   id = kmalloc(VIRTIO_BLK_ID_BYTES, GFP_KERNEL);
+   if (!id)
+   goto out;
+
+   err = virtio_config_buf(vblk-vdev, VIRTIO_BLK_F_IDENTIFY,
+   offsetof(struct virtio_blk_config, identify), id,
+   VIRTIO_BLK_ID_BYTES);
+
+   if (err)
+   goto out_kfree;
+
+   if (copy_to_user(argp, id, VIRTIO_BLK_ID_BYTES))
+   err = -EFAULT;
+
+out_kfree:
+   kfree(id);
+out:
+   return err;
+}
+
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 unsigned cmd, unsigned long data)
 {
-   return scsi_cmd_ioctl(bdev-bd_disk-queue,
- bdev-bd_disk, mode, cmd,
- (void __user *)data);
+   struct gendisk *disk = bdev-bd_disk;
+   void __user *argp = (void __user *)data;
+
+   switch (cmd) {
+   case HDIO_GET_IDENTITY:
+   return virtblk_identify(disk, argp);
+   default:
+   return scsi_cmd_ioctl(disk-queue, disk, mode, cmd, argp);
+   }
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
@@ -356,6 +390,7 @@ static struct virtio_device_id id_table[
 static unsigned int features[] = {
VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+   VIRTIO_BLK_F_IDENTIFY
 };
 
 static struct virtio_driver virtio_blk = {
=
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -15,7 +15,13 @@
 #define VIRTIO_BLK_F_GEOMETRY  4   /* Legacy geometry available  */
 #define VIRTIO_BLK_F_RO5   /* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE  6   /* Block size of disk is available*/
+#define VIRTIO_BLK_F_IDENTIFY  8   /* ATA IDENTIFY supported */
 
+#define VIRTIO_BLK_ID_LEN 256
+#define VIRTIO_BLK_ID_BYTES (VIRTIO_BLK_ID_LEN * sizeof (u16))
+
+/* mapped into pci i/o region 0
+ */
 struct virtio_blk_config
 {
/* The capacity (in 512-byte sectors). */
@@ -32,6 +38,7 @@ struct virtio_blk_config
} geometry;
/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
__u32 blk_size;
+   __u16 identify[VIRTIO_BLK_ID_LEN];
 } __attribute__((packed));
 
 /* These two define direction. */
-- 
john.coo...@redhat.com



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Add serial number support for virtio_blk, V2

2009-05-26 Thread john cooper

Christoph Hellwig wrote:

On Mon, May 18, 2009 at 11:00:41AM -0400, john cooper wrote:

Christoph Hellwig wrote:

So why can't we re-use the existing interfaces instead of inventing a
new one?

I'm unclear to what specifically you're referring -- the
ioctl() used to retrieve the serial number in the guest?


Well, there's not specific ioctl to get a serial number for scsi, but
given that we now have SG_IO passthrough in virtio-blk it should be easy
enough to provide inquiry data and the device identification VPD page
by that way.  Not sure how it's handled for ide, maybe that way
is even easier.


Yea, I'm hardly in enamored with the IDE/ATA
diatribe.  But in this case displacing the new
ioctl with HDIO_GET_IDENTITY seemed the most
straightforward means to provide access within
an existing interface.  Updated patch follows.

-john


--
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Add serial number support for virtio_blk, V3

2009-05-26 Thread john cooper


--
john.coo...@redhat.com

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index dad4ef0..90825a8 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -25,6 +25,7 @@ typedef struct VirtIOBlock
 BlockDriverState *bs;
 VirtQueue *vq;
 void *rq;
+char serial_str[BLOCK_SERIAL_STRLEN + 1];
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -285,6 +286,8 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 qemu_aio_flush();
 }
 
+/* coalesce internal state, copy to pci i/o region 0
+ */
 static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIOBlock *s = to_virtio_blk(vdev);
@@ -299,11 +302,13 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 stw_raw(blkcfg.cylinders, cylinders);
 blkcfg.heads = heads;
 blkcfg.sectors = secs;
+memcpy(blkcfg.serial, s-serial_str, sizeof (blkcfg.serial));
 memcpy(config, blkcfg, sizeof(blkcfg));
 }
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 {
+VirtIOBlock *s = to_virtio_blk(vdev);
 uint32_t features = 0;
 
 features |= (1  VIRTIO_BLK_F_SEG_MAX);
@@ -311,6 +316,8 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 #ifdef __linux__
 features |= (1  VIRTIO_BLK_F_SCSI);
 #endif
+if (strcmp(s-serial_str, 0))
+features |= 1  VIRTIO_BLK_F_SN;
 
 return features;
 }
@@ -353,6 +360,7 @@ void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs)
 VirtIOBlock *s;
 int cylinders, heads, secs;
 static int virtio_blk_id;
+char *ps = drive_get_serial(bs);
 
 s = (VirtIOBlock *)virtio_init_pci(bus, virtio-blk,
PCI_VENDOR_ID_REDHAT_QUMRANET,
@@ -369,6 +377,10 @@ void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs)
 s-vdev.reset = virtio_blk_reset;
 s-bs = bs;
 s-rq = NULL;
+if (strlen(ps))
+strncpy(s-serial_str, ps, sizeof (s-serial_str));
+else
+snprintf(s-serial_str, sizeof (s-serial_str), 0);
 bs-private = s-vdev.pci_dev;
 bdrv_guess_geometry(s-bs, cylinders, heads, secs);
 bdrv_set_geometry_hint(s-bs, cylinders, heads, secs);
diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
index 5ef6c36..3229394 100644
--- a/hw/virtio-blk.h
+++ b/hw/virtio-blk.h
@@ -31,6 +31,7 @@
 #define VIRTIO_BLK_F_RO 5   /* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE   6   /* Block size of disk is available*/
 #define VIRTIO_BLK_F_SCSI   7   /* Supports scsi command passthru */
+#define VIRTIO_BLK_F_SN 8   /* serial number supported */
 
 struct virtio_blk_config
 {
@@ -40,6 +41,8 @@ struct virtio_blk_config
 uint16_t cylinders;
 uint8_t heads;
 uint8_t sectors;
+uint32_t _blk_size;/* structure pad, currently unused */
+uint8_t serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */
diff --git a/sysemu.h b/sysemu.h
index 1f45fd6..185b4e3 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -141,6 +141,8 @@ typedef enum {
 BLOCK_ERR_STOP_ANY
 } BlockInterfaceErrorAction;
 
+#define BLOCK_SERIAL_STRLEN 20
+
 typedef struct DriveInfo {
 BlockDriverState *bdrv;
 BlockInterfaceType type;
@@ -149,7 +151,7 @@ typedef struct DriveInfo {
 int used;
 int drive_opt_idx;
 BlockInterfaceErrorAction onerror;
-char serial[21];
+char serial[BLOCK_SERIAL_STRLEN + 1];
 } DriveInfo;
 
 #define MAX_IDE_DEVS	2


[PATCH 2/2] Add serial number support for virtio_blk, V3

2009-05-26 Thread john cooper


--
john.coo...@redhat.com

 drivers/block/virtio_blk.c |   32 +---
 include/linux/virtio_blk.h |6 ++
 2 files changed, 35 insertions(+), 3 deletions(-)
=
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -146,12 +146,37 @@ static void do_virtblk_request(struct re
 		vblk-vq-vq_ops-kick(vblk-vq);
 }
 
+/* return serial# in IDE identify data (all other fields are currently zero)
+ */
+static int virtblk_identity(struct block_device *bdev, void *buf)
+{
+	struct virtio_blk *vblk = bdev-bd_disk-private_data;
+	u16 *id;
+	int rv;
+
+	if (!(id = kzalloc(ATA_ID_WORDS, GFP_KERNEL)))
+		rv = -ENOMEM;
+	else if ((rv = virtio_config_buf(vblk-vdev, VIRTIO_BLK_F_SN,
+		offsetof(struct virtio_blk_config, serial), id[ATA_ID_SERNO],
+		ATA_ID_SERNO_LEN)))
+			;
+	else if (copy_to_user(buf, id, ATA_ID_WORDS))
+		rv = -EFAULT;
+	else
+		rv = 0;
+	if (id)
+	kfree(id);
+return (rv);
+}
+
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 			 unsigned cmd, unsigned long data)
 {
-	return scsi_cmd_ioctl(bdev-bd_disk-queue,
-			  bdev-bd_disk, mode, cmd,
-			  (void __user *)data);
+	if (cmd == HDIO_GET_IDENTITY)
+		return (virtblk_identity(bdev, (void __user *)data));
+	else
+		return scsi_cmd_ioctl(bdev-bd_disk-queue, bdev-bd_disk,
+			mode, cmd, (void __user *)data);
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
@@ -356,6 +381,7 @@ static struct virtio_device_id id_table[
 static unsigned int features[] = {
 	VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
 	VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+	VIRTIO_BLK_F_SN
 };
 
 static struct virtio_driver virtio_blk = {
=
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -15,7 +15,12 @@
 #define VIRTIO_BLK_F_GEOMETRY	4	/* Legacy geometry available  */
 #define VIRTIO_BLK_F_RO		5	/* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE	6	/* Block size of disk is available*/
+#define VIRTIO_BLK_F_SN		8	/* serial number supported */
 
+#define BLOCK_SERIAL_STRLEN	20
+
+/* mapped into pci i/o region 0
+ */
 struct virtio_blk_config
 {
 	/* The capacity (in 512-byte sectors). */
@@ -32,6 +37,7 @@ struct virtio_blk_config
 	} geometry;
 	/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
 	__u32 blk_size;
+	__u8 serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */


Re: [PATCH 0/2] Add serial number support for virtio_blk, V2

2009-05-18 Thread john cooper

Christoph Hellwig wrote:

On Wed, May 13, 2009 at 01:06:57PM -0400, john cooper wrote:

[Resend of earlier patch: 1/2 rebased to qemu-kvm,
2/2 minor tweak]


patch 1/2 seems to be missing.

It is in the kvm and qemu-devel list archives:

http://www.spinics.net/lists/kvm/maillist.html
http://lists.gnu.org/archive/html/qemu-devel/2009-05/msg00661.html



Equivalent functionality currently exists for IDE
and SCSI, however it is not yet implemented for
virtio.


So why can't we re-use the existing interfaces instead of inventing a
new one?

I'm unclear to what specifically you're referring -- the
ioctl() used to retrieve the serial number in the guest?

-john

--
john.coo...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Add serial number support for virtio_blk, V2

2009-05-13 Thread john cooper

[Resend of earlier patch: 1/2 rebased to qemu-kvm,
2/2 minor tweak]

This patch allows passing of a virtio_blk drive
serial number from qemu into a guest's virtio_blk
driver, and provides a means to access the serial
number from a guest's userspace.

Equivalent functionality currently exists for IDE
and SCSI, however it is not yet implemented for
virtio.  Scenarios exist where guest code relies
on a unique drive serial number to correctly
identify the machine environment in which it
exists.

The following two patches implement the above:

   qemu-vblk-serial-2.patch

which provides the qemu missing bits to interpret
a '-drive .. serial=XYZ ..' flag, and:

   virtio_blk-serial-2.patch

which extracts this information and makes it
available to guest userspace via ioctl.

Attached to this patch header is a trivial example
program which retrieves the serial number from
guest userspace.

The above patches are relative to qemu-kvm.git and
2.6.29.3 respectively.

-john

--
john.coo...@redhat.com

/* example: retrieve serial number from virtio block device
 */
#include stdio.h
#include fcntl.h
#include stdlib.h
#include linux/virtio_blk.h

#define iswhite(c)	(!('!' = (c)  (c) = '~'))

#ifndef VBLK_GET_SN
#define VBLK_GET_SN ((unsigned int)('V'  24 | 'B'  16 | 'L'  8 | 'K'))
#endif

/* get virtblk drive serial#
 */
int main(int ac, char ***av)
{
	int fd, nb, i;
	unsigned char sn[30];
	unsigned char *p;

	sn[0] = sizeof (sn);
	if ((fd = open(/dev/vda, O_RDONLY))  0)
		perror(can't open device), exit(1);
	else if ((nb = ioctl(fd, VBLK_GET_SN, sn))  0)
		perror(can't ioctl device), exit(1);
	printf(returned %d bytes:\n, nb);
	for (p = sn, i = nb; 0 = --i; ++p)
		printf(%02x%c, *p, i ? ' ' : '\t');
	for (p = sn, i = nb; 0 = --i; ++p)
		printf(%c%s, iswhite(*p) ? '.' : *p, i ?  : \n);
	return (0);
}


[PATCH 1/2] Add serial number support for virtio_blk, V2

2009-05-13 Thread john cooper


--
john.coo...@redhat.com

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index dad4ef0..90825a8 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -25,6 +25,7 @@ typedef struct VirtIOBlock
 BlockDriverState *bs;
 VirtQueue *vq;
 void *rq;
+char serial_str[BLOCK_SERIAL_STRLEN + 1];
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -285,6 +286,8 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 qemu_aio_flush();
 }
 
+/* coalesce internal state, copy to pci i/o region 0
+ */
 static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIOBlock *s = to_virtio_blk(vdev);
@@ -299,11 +302,13 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 stw_raw(blkcfg.cylinders, cylinders);
 blkcfg.heads = heads;
 blkcfg.sectors = secs;
+memcpy(blkcfg.serial, s-serial_str, sizeof (blkcfg.serial));
 memcpy(config, blkcfg, sizeof(blkcfg));
 }
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 {
+VirtIOBlock *s = to_virtio_blk(vdev);
 uint32_t features = 0;
 
 features |= (1  VIRTIO_BLK_F_SEG_MAX);
@@ -311,6 +316,8 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 #ifdef __linux__
 features |= (1  VIRTIO_BLK_F_SCSI);
 #endif
+if (strcmp(s-serial_str, 0))
+features |= 1  VIRTIO_BLK_F_SN;
 
 return features;
 }
@@ -353,6 +360,7 @@ void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs)
 VirtIOBlock *s;
 int cylinders, heads, secs;
 static int virtio_blk_id;
+char *ps = drive_get_serial(bs);
 
 s = (VirtIOBlock *)virtio_init_pci(bus, virtio-blk,
PCI_VENDOR_ID_REDHAT_QUMRANET,
@@ -369,6 +377,10 @@ void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs)
 s-vdev.reset = virtio_blk_reset;
 s-bs = bs;
 s-rq = NULL;
+if (strlen(ps))
+strncpy(s-serial_str, ps, sizeof (s-serial_str));
+else
+snprintf(s-serial_str, sizeof (s-serial_str), 0);
 bs-private = s-vdev.pci_dev;
 bdrv_guess_geometry(s-bs, cylinders, heads, secs);
 bdrv_set_geometry_hint(s-bs, cylinders, heads, secs);
diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
index 5ef6c36..3229394 100644
--- a/hw/virtio-blk.h
+++ b/hw/virtio-blk.h
@@ -31,6 +31,7 @@
 #define VIRTIO_BLK_F_RO 5   /* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE   6   /* Block size of disk is available*/
 #define VIRTIO_BLK_F_SCSI   7   /* Supports scsi command passthru */
+#define VIRTIO_BLK_F_SN 8   /* serial number supported */
 
 struct virtio_blk_config
 {
@@ -40,6 +41,8 @@ struct virtio_blk_config
 uint16_t cylinders;
 uint8_t heads;
 uint8_t sectors;
+uint32_t _blk_size;/* structure pad, currently unused */
+uint8_t serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */
diff --git a/sysemu.h b/sysemu.h
index 1f45fd6..185b4e3 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -141,6 +141,8 @@ typedef enum {
 BLOCK_ERR_STOP_ANY
 } BlockInterfaceErrorAction;
 
+#define BLOCK_SERIAL_STRLEN 20
+
 typedef struct DriveInfo {
 BlockDriverState *bdrv;
 BlockInterfaceType type;
@@ -149,7 +151,7 @@ typedef struct DriveInfo {
 int used;
 int drive_opt_idx;
 BlockInterfaceErrorAction onerror;
-char serial[21];
+char serial[BLOCK_SERIAL_STRLEN + 1];
 } DriveInfo;
 
 #define MAX_IDE_DEVS	2


[PATCH 2/2] Add serial number support for virtio_blk, V2

2009-05-13 Thread john cooper


--
john.coo...@redhat.com

 drivers/block/virtio_blk.c |   35 ---
 include/linux/virtio_blk.h |   10 ++
 2 files changed, 42 insertions(+), 3 deletions(-)
=
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -146,12 +146,40 @@ static void do_virtblk_request(struct re
 		vblk-vq-vq_ops-kick(vblk-vq);
 }
 
+/* user passes the address of a char[] for serial# return, and has set char[0]
+ * to the array size.  copy serial# to this char[] and return number of
+ * characters copied excluding any trailing '\0' pad chars in buffer.
+ */
+static int get_virtblk_sn(struct block_device *bdev, void *buf)
+{
+	struct virtio_blk *vblk = bdev-bd_disk-private_data;
+	unsigned char serial[BLOCK_SERIAL_STRLEN];
+	unsigned char snlen;
+	int rv;
+
+	if (copy_from_user(snlen, buf, sizeof (snlen)))
+		rv = -EFAULT;
+	else if ((rv = virtio_config_val(vblk-vdev, VIRTIO_BLK_F_SN,
+		offsetof(struct virtio_blk_config, serial), serial)))
+			;
+	else if (copy_to_user(buf, serial,
+		snlen = min(snlen, (unsigned char)sizeof (serial
+			rv = -EFAULT;
+	else
+		for (rv = 0; rv  snlen; ++rv)
+			if (!serial[rv])
+break;
+	return (rv);
+}
+
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 			 unsigned cmd, unsigned long data)
 {
-	return scsi_cmd_ioctl(bdev-bd_disk-queue,
-			  bdev-bd_disk, mode, cmd,
-			  (void __user *)data);
+	if (cmd == VBLK_GET_SN)
+		return (get_virtblk_sn(bdev, (void __user *)data));
+	else
+		return scsi_cmd_ioctl(bdev-bd_disk-queue, bdev-bd_disk,
+			mode, cmd, (void __user *)data);
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
@@ -356,6 +384,7 @@ static struct virtio_device_id id_table[
 static unsigned int features[] = {
 	VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
 	VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+	VIRTIO_BLK_F_SN
 };
 
 static struct virtio_driver virtio_blk = {
=
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -15,7 +15,16 @@
 #define VIRTIO_BLK_F_GEOMETRY	4	/* Legacy geometry available  */
 #define VIRTIO_BLK_F_RO		5	/* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE	6	/* Block size of disk is available*/
+#define VIRTIO_BLK_F_SN		8	/* serial number supported */
 
+/* ioctl cmd to retrieve serial#
+*/
+#define VBLK_GET_SN ((unsigned int)('V'  24 | 'B'  16 | 'L'  8 | 'K'))
+
+#define BLOCK_SERIAL_STRLEN 20
+
+/* mapped into pci i/o region 0
+ */
 struct virtio_blk_config
 {
 	/* The capacity (in 512-byte sectors). */
@@ -32,6 +41,7 @@ struct virtio_blk_config
 	} geometry;
 	/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
 	__u32 blk_size;
+	__u8 serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */


Re: changing guest CD

2009-05-11 Thread john cooper

Stuart Jansen wrote:

Does KVM support changing the CD in a running guest's disc drive? I've
tried to do it using the qemu monitor, but so far haven't been able to.
I've seen rumor and innuendo that KVM can't change the disc in a running
system, but no official confirmation yet.

If KVM doesn't support changing the disc in a running system, what would
be required to support it?



The following worked for me when doing a
guest install from multiple CD iso images:

 to enter monitor from guest screen:

 cntlalt2

 to exit back to guest screen:

 cntlalt1

 changing a cd image:

 (qemu) info block

 list of block devices

 (qemu) eject ide1-cd0
 (qemu) change ide1-cd0 target_file

 cntlalt1


--
john.coo...@third-harmonic.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Add serial number support for virtio_blk

2009-04-29 Thread john cooper


--
john.coo...@third-harmonic.com
 hw/virtio-blk.c |   15 ++-
 hw/virtio-blk.h |3 +++
 sysemu.h|4 +++-
 3 files changed, 20 insertions(+), 2 deletions(-)
=
--- a/qemu/hw/virtio-blk.h
+++ b/qemu/hw/virtio-blk.h
@@ -28,6 +28,7 @@
 #define VIRTIO_BLK_F_SIZE_MAX   1   /* Indicates maximum segment size */
 #define VIRTIO_BLK_F_SEG_MAX2   /* Indicates maximum # of segments */
 #define VIRTIO_BLK_F_GEOMETRY   4   /* Indicates support of legacy geometry */
+#define VIRTIO_BLK_F_SN		7   /* serial number supported */
 
 struct virtio_blk_config
 {
@@ -37,6 +38,8 @@ struct virtio_blk_config
 uint16_t cylinders;
 uint8_t heads;
 uint8_t sectors;
+uint32_t _blk_size;/* structure pad, currently unused */
+uint8_t serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */
=
--- a/qemu/hw/virtio-blk.c
+++ b/qemu/hw/virtio-blk.c
@@ -22,6 +22,7 @@ typedef struct VirtIOBlock
 BlockDriverState *bs;
 VirtQueue *vq;
 void *rq;
+char serial_str[BLOCK_SERIAL_STRLEN + 1];
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -238,6 +239,8 @@ static void virtio_blk_reset(VirtIODevic
 qemu_aio_flush();
 }
 
+/* coalesce internal state, copy to pci i/o region 0
+ */
 static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIOBlock *s = to_virtio_blk(vdev);
@@ -252,12 +255,17 @@ static void virtio_blk_update_config(Vir
 stw_raw(blkcfg.cylinders, cylinders);
 blkcfg.heads = heads;
 blkcfg.sectors = secs;
+memcpy(blkcfg.serial, s-serial_str, sizeof (blkcfg.serial));
 memcpy(config, blkcfg, sizeof(blkcfg));
 }
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 {
-return (1  VIRTIO_BLK_F_SEG_MAX | 1  VIRTIO_BLK_F_GEOMETRY);
+VirtIOBlock *s = to_virtio_blk(vdev);
+char ser_set = strcmp(s-serial_str, 0);
+
+return (1  VIRTIO_BLK_F_SEG_MAX | 1  VIRTIO_BLK_F_GEOMETRY |
+(ser_set ? 1  VIRTIO_BLK_F_SN : 0));
 }
 
 static void virtio_blk_save(QEMUFile *f, void *opaque)
@@ -298,6 +306,7 @@ void *virtio_blk_init(PCIBus *bus, Block
 VirtIOBlock *s;
 int cylinders, heads, secs;
 static int virtio_blk_id;
+char *ps = drive_get_serial(bs);
 
 s = (VirtIOBlock *)virtio_init_pci(bus, virtio-blk,
PCI_VENDOR_ID_REDHAT_QUMRANET,
@@ -314,6 +323,10 @@ void *virtio_blk_init(PCIBus *bus, Block
 s-vdev.reset = virtio_blk_reset;
 s-bs = bs;
 s-rq = NULL;
+if (strlen(ps))
+strncpy(s-serial_str, ps, sizeof (s-serial_str));
+else
+snprintf(s-serial_str, sizeof (s-serial_str), 0);
 bs-devfn = s-vdev.pci_dev.devfn;
 bdrv_guess_geometry(s-bs, cylinders, heads, secs);
 bdrv_set_geometry_hint(s-bs, cylinders, heads, secs);
=
--- a/qemu/sysemu.h
+++ b/qemu/sysemu.h
@@ -133,13 +133,15 @@ typedef enum {
 BLOCK_ERR_STOP_ANY
 } BlockInterfaceErrorAction;
 
+#define BLOCK_SERIAL_STRLEN 20
+
 typedef struct DriveInfo {
 BlockDriverState *bdrv;
 BlockInterfaceType type;
 int bus;
 int unit;
 BlockInterfaceErrorAction onerror;
-char serial[21];
+char serial[BLOCK_SERIAL_STRLEN + 1];
 int used;
 int drive_opt_idx;
 } DriveInfo;


[PATCH 2/2] Add serial number support for virtio_blk

2009-04-29 Thread john cooper


--
john.coo...@third-harmonic.com
 drivers/block/virtio_blk.c |   36 +---
 include/linux/virtio_blk.h |   10 ++
 2 files changed, 43 insertions(+), 3 deletions(-)
=
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -14,7 +14,16 @@
 #define VIRTIO_BLK_F_GEOMETRY	4	/* Legacy geometry available  */
 #define VIRTIO_BLK_F_RO		5	/* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE	6	/* Block size of disk is available*/
+#define VIRTIO_BLK_F_SN		7	/* serial number supported */
 
+/* ioctl cmd to retrieve serial#
+*/
+#define VBLK_GET_SN ((unsigned int)('V'  24 | 'B'  16 | 'L'  8 | 'K'))
+
+#define BLOCK_SERIAL_STRLEN 20
+
+/* mapped into pci i/o region 0
+ */
 struct virtio_blk_config
 {
 	/* The capacity (in 512-byte sectors). */
@@ -31,6 +40,7 @@ struct virtio_blk_config
 	} geometry;
 	/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
 	__u32 blk_size;
+	__u8 serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */
=
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -146,12 +146,41 @@ static void do_virtblk_request(struct re
 		vblk-vq-vq_ops-kick(vblk-vq);
 }
 
+/* user passes the address of a char[] for serial# return, and has set char[0]
+ * to the array size.  copy serial# to this char[] and return number of
+ * characters copied excluding any trailing '\0' pad chars in buffer.
+ */
+#define _min(a,b)	((a)  (b) ? (a) : (b))
+static int get_virtblk_sn(struct block_device *bdev, void *buf)
+{
+	struct virtio_blk *vblk = bdev-bd_disk-private_data;
+	unsigned char serial[BLOCK_SERIAL_STRLEN];
+	unsigned char snlen;
+	int rv;
+
+	if (copy_from_user(snlen, buf, sizeof (snlen)))
+		rv = -EFAULT;
+	else if ((rv = virtio_config_val(vblk-vdev, VIRTIO_BLK_F_SN,
+		offsetof(struct virtio_blk_config, serial), serial)))
+			;
+	else if (copy_to_user(buf, serial,
+		snlen = min(snlen, (unsigned char)sizeof (serial
+			rv = -EFAULT;
+	else
+		for (rv = 0; rv  snlen; ++rv)
+			if (!serial[rv])
+break;
+	return (rv);
+}
+
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 			 unsigned cmd, unsigned long data)
 {
-	return scsi_cmd_ioctl(bdev-bd_disk-queue,
-			  bdev-bd_disk, mode, cmd,
-			  (void __user *)data);
+	if (cmd == VBLK_GET_SN)
+		return (get_virtblk_sn(bdev, (void __user *)data));
+	else
+		return scsi_cmd_ioctl(bdev-bd_disk-queue, bdev-bd_disk,
+			mode, cmd, (void __user *)data);
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
@@ -339,6 +368,7 @@ static struct virtio_device_id id_table[
 static unsigned int features[] = {
 	VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
 	VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+	VIRTIO_BLK_F_SN
 };
 
 static struct virtio_driver virtio_blk = {


[PATCH 0/2] Add serial number support for virtio_blk

2009-04-29 Thread john cooper

This patch allows passing of a virtio_blk drive
serial number from qemu into a guest's virtio_blk
driver, and provides a means to access the serial
number from a guest's userspace.

Equivalent functionality currently exists for IDE
and SCSI, however it is not yet implemented for
virtio.  Scenarios exist where guest code relies
on a unique drive serial number to correctly
identify the machine environment in which it
exists.

The following two patches implement the above

qemu-vblk-serial.patch

which provides the qemu missing bits to interpret
a '-drive .. serial=XYZ ..' flag, and

virtio_blk-serial.patch

which extracts this information and make it
available to guest userspace via ioctl.

Attached to this patch header is a trivial example
program which retrieves the serial number from
guest userspace.

The above patches are relative to kvm-84 and
2.6.28 respectively.

-john


--
john.coo...@third-harmonic.com
/* example: retrieve serial number from virtio block device
 */
#include stdio.h
#include fcntl.h
#include stdlib.h
#include linux/virtio_blk.h

#define iswhite(c)	(!('!' = (c)  (c) = '~'))

#ifndef VBLK_GET_SN
#define VBLK_GET_SN ((unsigned int)('V'  24 | 'B'  16 | 'L'  8 | 'K'))
#endif

/* get virtblk drive serial#
 */
int main(int ac, char ***av)
{
	int fd, nb, i;
	unsigned char sn[30];
	unsigned char *p;

	sn[0] = sizeof (sn);
	if ((fd = open(/dev/vda, O_RDONLY))  0)
		perror(can't open device), exit(1);
	else if ((nb = ioctl(fd, VBLK_GET_SN, sn))  0)
		perror(can't ioctl device), exit(1);
	printf(returned %d bytes:\n, nb);
	for (p = sn, i = nb; 0 = --i; ++p)
		printf(%02x%c, *p, i ? ' ' : '\t');
	for (p = sn, i = nb; 0 = --i; ++p)
		printf(%c%s, iswhite(*p) ? '.' : *p, i ?  : \n);
	return (0);
}


[PATCH 1/2] Add serial number support for virtio_blk

2009-04-29 Thread john cooper


--
john.coo...@third-harmonic.com
 hw/virtio-blk.c |   15 ++-
 hw/virtio-blk.h |3 +++
 sysemu.h|4 +++-
 3 files changed, 20 insertions(+), 2 deletions(-)
=
--- a/qemu/hw/virtio-blk.h
+++ b/qemu/hw/virtio-blk.h
@@ -28,6 +28,7 @@
 #define VIRTIO_BLK_F_SIZE_MAX   1   /* Indicates maximum segment size */
 #define VIRTIO_BLK_F_SEG_MAX2   /* Indicates maximum # of segments */
 #define VIRTIO_BLK_F_GEOMETRY   4   /* Indicates support of legacy geometry */
+#define VIRTIO_BLK_F_SN		7   /* serial number supported */
 
 struct virtio_blk_config
 {
@@ -37,6 +38,8 @@ struct virtio_blk_config
 uint16_t cylinders;
 uint8_t heads;
 uint8_t sectors;
+uint32_t _blk_size;/* structure pad, currently unused */
+uint8_t serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */
=
--- a/qemu/hw/virtio-blk.c
+++ b/qemu/hw/virtio-blk.c
@@ -22,6 +22,7 @@ typedef struct VirtIOBlock
 BlockDriverState *bs;
 VirtQueue *vq;
 void *rq;
+char serial_str[BLOCK_SERIAL_STRLEN + 1];
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -238,6 +239,8 @@ static void virtio_blk_reset(VirtIODevic
 qemu_aio_flush();
 }
 
+/* coalesce internal state, copy to pci i/o region 0
+ */
 static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIOBlock *s = to_virtio_blk(vdev);
@@ -252,12 +255,17 @@ static void virtio_blk_update_config(Vir
 stw_raw(blkcfg.cylinders, cylinders);
 blkcfg.heads = heads;
 blkcfg.sectors = secs;
+memcpy(blkcfg.serial, s-serial_str, sizeof (blkcfg.serial));
 memcpy(config, blkcfg, sizeof(blkcfg));
 }
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 {
-return (1  VIRTIO_BLK_F_SEG_MAX | 1  VIRTIO_BLK_F_GEOMETRY);
+VirtIOBlock *s = to_virtio_blk(vdev);
+char ser_set = strcmp(s-serial_str, 0);
+
+return (1  VIRTIO_BLK_F_SEG_MAX | 1  VIRTIO_BLK_F_GEOMETRY |
+(ser_set ? 1  VIRTIO_BLK_F_SN : 0));
 }
 
 static void virtio_blk_save(QEMUFile *f, void *opaque)
@@ -298,6 +306,7 @@ void *virtio_blk_init(PCIBus *bus, Block
 VirtIOBlock *s;
 int cylinders, heads, secs;
 static int virtio_blk_id;
+char *ps = drive_get_serial(bs);
 
 s = (VirtIOBlock *)virtio_init_pci(bus, virtio-blk,
PCI_VENDOR_ID_REDHAT_QUMRANET,
@@ -314,6 +323,10 @@ void *virtio_blk_init(PCIBus *bus, Block
 s-vdev.reset = virtio_blk_reset;
 s-bs = bs;
 s-rq = NULL;
+if (strlen(ps))
+strncpy(s-serial_str, ps, sizeof (s-serial_str));
+else
+snprintf(s-serial_str, sizeof (s-serial_str), 0);
 bs-devfn = s-vdev.pci_dev.devfn;
 bdrv_guess_geometry(s-bs, cylinders, heads, secs);
 bdrv_set_geometry_hint(s-bs, cylinders, heads, secs);
=
--- a/qemu/sysemu.h
+++ b/qemu/sysemu.h
@@ -133,13 +133,15 @@ typedef enum {
 BLOCK_ERR_STOP_ANY
 } BlockInterfaceErrorAction;
 
+#define BLOCK_SERIAL_STRLEN 20
+
 typedef struct DriveInfo {
 BlockDriverState *bdrv;
 BlockInterfaceType type;
 int bus;
 int unit;
 BlockInterfaceErrorAction onerror;
-char serial[21];
+char serial[BLOCK_SERIAL_STRLEN + 1];
 int used;
 int drive_opt_idx;
 } DriveInfo;


[PATCH 2/2] Add serial number support for virtio_blk

2009-04-29 Thread john cooper


--
john.coo...@third-harmonic.com
 drivers/block/virtio_blk.c |   36 +---
 include/linux/virtio_blk.h |   10 ++
 2 files changed, 43 insertions(+), 3 deletions(-)
=
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -14,7 +14,16 @@
 #define VIRTIO_BLK_F_GEOMETRY	4	/* Legacy geometry available  */
 #define VIRTIO_BLK_F_RO		5	/* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE	6	/* Block size of disk is available*/
+#define VIRTIO_BLK_F_SN		7	/* serial number supported */
 
+/* ioctl cmd to retrieve serial#
+*/
+#define VBLK_GET_SN ((unsigned int)('V'  24 | 'B'  16 | 'L'  8 | 'K'))
+
+#define BLOCK_SERIAL_STRLEN 20
+
+/* mapped into pci i/o region 0
+ */
 struct virtio_blk_config
 {
 	/* The capacity (in 512-byte sectors). */
@@ -31,6 +40,7 @@ struct virtio_blk_config
 	} geometry;
 	/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
 	__u32 blk_size;
+	__u8 serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */
=
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -146,12 +146,41 @@ static void do_virtblk_request(struct re
 		vblk-vq-vq_ops-kick(vblk-vq);
 }
 
+/* user passes the address of a char[] for serial# return, and has set char[0]
+ * to the array size.  copy serial# to this char[] and return number of
+ * characters copied excluding any trailing '\0' pad chars in buffer.
+ */
+#define _min(a,b)	((a)  (b) ? (a) : (b))
+static int get_virtblk_sn(struct block_device *bdev, void *buf)
+{
+	struct virtio_blk *vblk = bdev-bd_disk-private_data;
+	unsigned char serial[BLOCK_SERIAL_STRLEN];
+	unsigned char snlen;
+	int rv;
+
+	if (copy_from_user(snlen, buf, sizeof (snlen)))
+		rv = -EFAULT;
+	else if ((rv = virtio_config_val(vblk-vdev, VIRTIO_BLK_F_SN,
+		offsetof(struct virtio_blk_config, serial), serial)))
+			;
+	else if (copy_to_user(buf, serial,
+		snlen = min(snlen, (unsigned char)sizeof (serial
+			rv = -EFAULT;
+	else
+		for (rv = 0; rv  snlen; ++rv)
+			if (!serial[rv])
+break;
+	return (rv);
+}
+
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 			 unsigned cmd, unsigned long data)
 {
-	return scsi_cmd_ioctl(bdev-bd_disk-queue,
-			  bdev-bd_disk, mode, cmd,
-			  (void __user *)data);
+	if (cmd == VBLK_GET_SN)
+		return (get_virtblk_sn(bdev, (void __user *)data));
+	else
+		return scsi_cmd_ioctl(bdev-bd_disk-queue, bdev-bd_disk,
+			mode, cmd, (void __user *)data);
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
@@ -339,6 +368,7 @@ static struct virtio_device_id id_table[
 static unsigned int features[] = {
 	VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
 	VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+	VIRTIO_BLK_F_SN
 };
 
 static struct virtio_driver virtio_blk = {


Re: bad virtio disk performance

2009-04-27 Thread john cooper

Lucas Nussbaum wrote:

Hi,

I'm experiencing bad disk I/O performance using virtio disks.

I'm using Linux 2.6.29 (host  guest), kvm 84 userspace.
On the host, and in a non-virtio guest, I get ~120 MB/s when writing
with dd (the disks are fast RAID0 SAS disks).


Could you provide detail of the exact type and size
of i/o load you were creating with dd?

Also the full qemu cmd line invocation in both
cases would be useful.


In a guest with a virtio disk, I get at most ~32 MB/s.


Which non-virtio interface was used for the
comparison?


The rest of the setup is the same. For reference, I'm running kvm -drive
file=/tmp/debian-amd64.img,if=virtio.

Is such performance expected? What should I check?


Not expected, something is awry.

blktrace(8) run on the host will shed some light
on the type of i/o requests issued by qemu in both
cases.

-john


--
john.coo...@third-harmonic.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad virtio disk performance

2009-04-27 Thread john cooper

Lucas Nussbaum wrote:

On 27/04/09 at 13:36 -0400, john cooper wrote:

Lucas Nussbaum wrote:


non-virtio:
kvm -drive file=/tmp/debian-amd64.img,if=scsi,cache=writethrough -net
nic,macaddr=00:16:3e:5a:28:1,model=e1000 -net tap -nographic -kernel
/boot/vmlinuz-2.6.29 -initrd /boot/initrd.img-2.6.29 -append
root=/dev/sda1 ro console=tty0 console=ttyS0,9600,8n1

virtio:
kvm -drive file=/tmp/debian-amd64.img,if=virtio,cache=writethrough -net
nic,macaddr=00:16:3e:5a:28:1,model=e1000 -net tap -nographic -kernel
/boot/vmlinuz-2.6.29 -initrd /boot/initrd.img-2.6.29 -append
root=/dev/vda1 ro console=tty0 console=ttyS0,9600,8n1


One suggestion would be to use a separate drive
for the virtio vs. non-virtio comparison to avoid
a Heisenberg effect.



So, apparently, with virtio, there's a lot more data being written to
disk. The underlying filesystem is ext3, and is monted as /tmp. It only
contains the VM image file. Another difference is that, with virtio, the
data was shared equally over all 4 CPUs, with without virt-io, CPU0 and
CPU1 did all the work.
In the virtio log, I also see a (null) process doing a lot of writes.

Can't say what is causing that -- only took a look
at the short logs. However the isolation suggested
above may help factor that out if you need to
pursue this path.


I uploaded the logs to http://blop.info/bazaar/virtio/, if you want to
take a look.

In the virtio case i/o is being issued from multiple
threads. You could be hitting the cfq close-cooperator
bug we saw as late as 2.6.28.

A quick test to rule this in/out would be to change
the block scheduler to other than cfq for the host
device where the backing image resides -- in your
case the host device containing /tmp/debian-amd64.img.

Eg for /dev/sda1:

# cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
# echo deadline  /sys/block/sda/queue/scheduler
# cat /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq


And re-run your test to see if this brings
virtio vs. non-virtio closer to the expected
performance.

-john

--
john.coo...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mm/memory.c:unmap_vmas(): fix NULL * deref

2009-03-23 Thread john cooper

This cropped up in stress testing of a backport
of the mmu notifier mechanism, however it still
exists in 2.6.28.8 as well.  Patch attached.

Signed-off-by: john.coo...@redhat.com

--
john.coo...@third-harmonic.com
 mm/memory.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)
=
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -899,9 +899,10 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details-i_mmap_lock: NULL;
 	int fullmm = (*tlbp)-fullmm;
-	struct mm_struct *mm = vma-vm_mm;
+	struct mm_struct *mm = vma ? vma-vm_mm : NULL;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	if (mm)
+		mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma  vma-vm_start  end_addr; vma = vma-vm_next) {
 		unsigned long end;
 
@@ -966,7 +967,8 @@ unsigned long unmap_vmas(struct mmu_gath
 		}
 	}
 out:
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	if (mm)
+		mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 


Re: Resend: patch: qemu + hugetlbfs..

2009-01-23 Thread john cooper

Avi Kivity wrote:

john cooper wrote:

This trivial patch never did manage to find its way
in.  Marcelo called it to my attention earlier in
the week.  I've tweaked it to apply to kvm-83 and
the resulting patch is attached.  I've left the
prior e-mail discussion below for reference.



Sorry for missing this (copying me helps).  Please resubmit with a 
signoff.Also, please protect with #ifdef MAP_POPULATE to ease merging 
into upstream eventually.

Updated patch attached.

-john


Signed-off-by: john cooper john.coo...@redhat.com

--
john.coo...@redhat.com

 kernel/x86/Kbuild |4 ++--
 qemu/vl.c |   38 ++
 2 files changed, 36 insertions(+), 6 deletions(-)
=
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -237,6 +237,9 @@ int semihosting_enabled = 0;
 int time_drift_fix = 0;
 unsigned int kvm_shadow_memory = 0;
 const char *mem_path = NULL;
+#ifdef MAP_POPULATE
+int mem_prealloc = 1;  /* force preallocation of physical target memory */
+#endif
 int hpagesize = 0;
 const char *cpu_vendor_string;
 #ifdef TARGET_ARM
@@ -4116,7 +4119,12 @@ static void help(int exitcode)
 #endif
-tdfinject timer interrupts that got lost\n
-kvm-shadow-memory megs set the amount of shadow pages to be 
allocated\n
-   -mem-path   set the path to hugetlbfs/tmpfs mounted directory, 
also enables allocation of guest memory with huge pages\n
+   -mem-path   set the path to hugetlbfs/tmpfs mounted directory, 
also\n
+   enables allocation of guest memory with huge 
pages\n
+#ifdef MAP_POPULATE
+   -mem-prealloc   toggles preallocation of -mem-path backed physical 
memory\n
+   at startup.  Default is enabled.\n
+#endif
   -option-rom rom load a file, rom, into the option ROM space\n
 #ifdef TARGET_SPARC
-prom-env variable=value  set OpenBIOS nvram variables\n
@@ -4246,6 +4254,9 @@ enum {
 QEMU_OPTION_tdf,
 QEMU_OPTION_kvm_shadow_memory,
 QEMU_OPTION_mempath,
+#ifdef MAP_POPULATE
+QEMU_OPTION_mem_prealloc,
+#endif
 };
 
 typedef struct QEMUOption {
@@ -4381,6 +4392,9 @@ static const QEMUOption qemu_options[] =
 { icount, HAS_ARG, QEMU_OPTION_icount },
 { incoming, HAS_ARG, QEMU_OPTION_incoming },
 { mem-path, HAS_ARG, QEMU_OPTION_mempath },
+#ifdef MAP_POPULATE
+{ mem-prealloc, 0, QEMU_OPTION_mem_prealloc },
+#endif
 { NULL },
 };
 
@@ -4663,6 +4677,9 @@ void *alloc_mem_area(size_t memory, unsi
 char *filename;
 void *area;
 int fd;
+#ifdef MAP_POPULATE
+int flags;
+#endif
 
 if (asprintf(filename, %s/kvm.XX, path) == -1)
return NULL;
@@ -4690,13 +4707,21 @@ void *alloc_mem_area(size_t memory, unsi
  */
 ftruncate(fd, memory);
 
+#ifdef MAP_POPULATE
+/* NB: MAP_POPULATE won't exhaustively alloc all phys pages in the case
+ * MAP_PRIVATE is requested.  For mem_prealloc we mmap as MAP_SHARED
+ * to sidestep this quirk.
+ */
+flags = mem_prealloc ? MAP_POPULATE|MAP_SHARED : MAP_PRIVATE;
+area = mmap(0, memory, PROT_READ|PROT_WRITE, flags, fd, 0);
+#else
 area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
+#endif
 if (area == MAP_FAILED) {
-   perror(mmap);
+   perror(alloc_mem_area: can't mmap hugetlbfs pages);
close(fd);
-   return NULL;
+   return (NULL);
 }
-
 *len = memory;
 return area;
 }
@@ -5377,6 +5402,11 @@ int main(int argc, char **argv, char **e
 case QEMU_OPTION_mempath:
mem_path = optarg;
break;
+#ifdef MAP_POPULATE
+case QEMU_OPTION_mem_prealloc:
+   mem_prealloc = !mem_prealloc;
+   break;
+#endif
 case QEMU_OPTION_name:
 qemu_name = optarg;
 break;
=
--- a/kernel/x86/Kbuild
+++ b/kernel/x86/Kbuild
@@ -9,8 +9,8 @@ kvm-objs := kvm_main.o x86.o mmu.o x86_e
 ifeq ($(EXT_CONFIG_KVM_TRACE),y)
 kvm-objs += kvm_trace.o
 endif
-ifeq ($(CONFIG_DMAR),y)
-kvm-objs += vtd.o
+ifeq ($(CONFIG_IOMMU_API),y)
+kvm-objs += iommu.o
 endif
 kvm-intel-objs := vmx.o vmx-debug.o ../external-module-compat.o
 kvm-amd-objs := svm.o ../external-module-compat.o


Re: Resend: patch: qemu + hugetlbfs..

2008-08-26 Thread john cooper

Avi Kivity wrote:

john cooper wrote:

This patch from over a month ago doesn't seem to have
made it into kvm-73 and may have been lost in the
shuffle.  Attached is essentially the same patch but
as applied to kvm-73, and validated relative to that
version.



What is the motivation for providing an option to disable this?  If we 
can detect mem-path is backed by huge pages somehow, I think we can 
prefault the memory unconditionally.



Pre-allocation of the entire huge page backed guest
memory avoids the nondeterministic termination but
admittedly is overly pessimistic.  As this patch does
so by default when -mem-path is specified, allowing
for disable of pre-allocation simply reverts this
change to prior behavior for use cases more tolerant
to it as well as for debug purposes.

The real fix arguably hinges on huge pages having
more general virtual memory behavior.  But that
appears to be a much longer term prospect.

-john

--
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Resend: patch: qemu + hugetlbfs..

2008-08-25 Thread john cooper

This patch from over a month ago doesn't seem to have
made it into kvm-73 and may have been lost in the
shuffle.  Attached is essentially the same patch but
as applied to kvm-73, and validated relative to that
version.

In a nutshell the intention here is to allow
preallocation of guest huge page backed memory at
qemu initialization time to avoid a quirk in the
kernel's huge page accounting allowing overcommit
of huge pages.  Failure of the kernel to resolve a
guest fault to overcommitted huge page memory during
runtime results in sigkill termination of the guest.
This patch provides the option of avoiding such
behavior at the cost of up-front preallocation of
physical huge pages backing the guest.

-john


Anthony Liguori wrote:

john cooper wrote:

Anthony Liguori wrote:

john cooper wrote:

As it currently exists alloc_hpage_mem() is tied to
the notion of huge page allocation as it will reference
gethugepagesize() irrespective of *mem_path.  So even
in the case of tmpfs backed files, if the host kernel
has been configured with CONFIG_HUGETLBFS we will wind
up doing allocations of /dev/shm mapped files at
/proc/meminfo:Hugepagesize granularity.


Which is fine.  It just means we round -m values up to even numbers.


Well, yes it will round the allocation.  But from a
minimally sufficient 4KB boundary to that of 4MB/2MB
relative to a 32/64 bit x86 host which is excessive.


Probably not what was intended but probably not too
much of a concern as -mem-path /dev/shm is likely
only used in debug of this flag and associated logic.
I don't see it currently being worth the trouble to
correct from a squeaky clean POV, and doing so may
drag in far more than the header file we've just
booted above to deal with this architecture/config
dependency.


Renaming a function to a name that's less accurate seems bad to me.  
I don't mean to be pedantic, but it seems like a strange thing to 
do.  I prefer it the way it was before.


I don't see any harm reverting the name.  But I do
believe it is largely cosmetic as given the above,
the current code does require some work to make it
independent of huge page assumptions.  Update attached.

-john


Looks good to me.

Acked-by: Anthony Liguori [EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
[EMAIL PROTECTED]
 vl.c |   48 ++--
 1 file changed, 30 insertions(+), 18 deletions(-)
=
--- ./qemu/vl.c.PAORG
+++ ./qemu/vl.c
@@ -239,6 +239,7 @@ int autostart = 1;
 int time_drift_fix = 0;
 unsigned int kvm_shadow_memory = 0;
 const char *mem_path = NULL;
+int mem_prealloc = 1;  /* force preallocation of physical target memory */
 int hpagesize = 0;
 const char *cpu_vendor_string;
 #ifdef TARGET_ARM
@@ -8423,7 +8424,10 @@ static void help(int exitcode)
 #endif
-tdfinject timer interrupts that got lost\n
-kvm-shadow-memory megs set the amount of shadow pages to be 
allocated\n
-   -mem-path   set the path to hugetlbfs/tmpfs mounted directory, 
also enables allocation of guest memory with huge pages\n
+   -mem-path   set the path to hugetlbfs/tmpfs mounted directory, 
also\n
+   enables allocation of guest memory with huge 
pages\n
+   -mem-prealloc   toggles preallocation of -mem-path backed physical 
memory\n
+   at startup.  Default is enabled.\n
   -option-rom rom load a file, rom, into the option ROM space\n
 #ifdef TARGET_SPARC
-prom-env variable=value  set OpenBIOS nvram variables\n
@@ -8546,6 +8550,7 @@ enum {
 QEMU_OPTION_tdf,
 QEMU_OPTION_kvm_shadow_memory,
 QEMU_OPTION_mempath,
+QEMU_OPTION_mem_prealloc
 };
 
 typedef struct QEMUOption {
@@ -8671,6 +8676,7 @@ const QEMUOption qemu_options[] = {
 { tb-size, HAS_ARG, QEMU_OPTION_tb_size },
 { icount, HAS_ARG, QEMU_OPTION_icount },
 { mem-path, HAS_ARG, QEMU_OPTION_mempath },
+{ mem-prealloc, 0, QEMU_OPTION_mem_prealloc },
 { NULL },
 };
 
@@ -8890,11 +8896,13 @@ static int gethugepagesize(void)
 return hugepagesize;
 }
 
+/* attempt to allocate memory mmap'ed to mem_path
+ */
 void *alloc_mem_area(unsigned long memory, const char *path)
 {
 char *filename;
 void *area;
-int fd;
+int fd, flags;
 
 if (asprintf(filename, %s/kvm.XX, path) == -1)
return NULL;
@@ -8922,26 +8930,27 @@ void *alloc_mem_area(unsigned long memor
  */
 ftruncate(fd, memory);
 
-area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
-if (area == MAP_FAILED) {
-   perror(mmap);
-   close(fd);
-   return NULL;
-}
-
-return area;
+/* NB: MAP_POPULATE won't exhaustively alloc all phys pages in the case
+ * MAP_PRIVATE is requested

Re: Un-googlable

2008-08-17 Thread john cooper

David Abrahams wrote:

on Sat Aug 16 2008, Glauber Costa glommer-AT-gmail.com wrote:

Furthermore, we're not the first word with (at least) double meaning
in the world.


No, but as I said, the target audience (and application area) overlaps
so heavily that even adding other relevant search terms usually doesn't
eliminate the chaff.


I seem to recall from back when some half-baked
software known as windows which was introduced
by that name despite existing software at the
time using the same moniker.  Doesn't appear to
have impacted its viral propagation in the world
too much.

Give it some time.  I think it is a safe bet for
the hypervisor by that name to be more ubiquitous
than the like named console multiplexer box.

-john

--
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: patch: qemu + hugetlbfs..

2008-07-10 Thread john cooper

Anthony Liguori wrote:


+#include asm/param.h
  


I don't think this is necessary anymore.  Depending on a Linux headers 
breaks the QEMU build on other unices so it's a bad thing.


It is no longer required, but see below.

hpage is a misnomer too as we aren't actually dependent on huge pages (this 
code should work equally well for tmpfs).


As it currently exists alloc_hpage_mem() is tied to
the notion of huge page allocation as it will reference
gethugepagesize() irrespective of *mem_path.  So even
in the case of tmpfs backed files, if the host kernel
has been configured with CONFIG_HUGETLBFS we will wind
up doing allocations of /dev/shm mapped files at
/proc/meminfo:Hugepagesize granularity.  Otherwise if
HUGETLBFS is not configured gethugepagesize() returns
zero and alloc_hpage_mem() itself will not perform the
allocation.

Probably not what was intended but probably not too
much of a concern as -mem-path /dev/shm is likely
only used in debug of this flag and associated logic.
I don't see it currently being worth the trouble to
correct from a squeaky clean POV, and doing so may
drag in far more than the header file we've just
booted above to deal with this architecture/config
dependency.

An updated patch is attached.

-john

--
[EMAIL PROTECTED]
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -234,6 +234,7 @@ int autostart = 1;
 int time_drift_fix = 0;
 unsigned int kvm_shadow_memory = 0;
 const char *mem_path = NULL;
+int mem_prealloc = 1;  /* force preallocation of physical target memory */
 int hpagesize = 0;
 const char *cpu_vendor_string;
 #ifdef TARGET_ARM
@@ -7809,7 +7810,10 @@ static void help(int exitcode)
 #endif
-tdfinject timer interrupts that got lost\n
-kvm-shadow-memory megs set the amount of shadow pages to be 
allocated\n
-   -mem-path   set the path to hugetlbfs/tmpfs mounted directory, 
also enables allocation of guest memory with huge pages\n
+   -mem-path   set the path to hugetlbfs/tmpfs mounted directory, 
also\n
+   enables allocation of guest memory with huge 
pages\n
+   -mem-prealloc   toggles preallocation of -mem-path backed physical 
memory\n
+   at startup.  Default is enabled.\n
   -option-rom rom load a file, rom, into the option ROM space\n
 #ifdef TARGET_SPARC
-prom-env variable=value  set OpenBIOS nvram variables\n
@@ -7932,6 +7936,7 @@ enum {
 QEMU_OPTION_tdf,
 QEMU_OPTION_kvm_shadow_memory,
 QEMU_OPTION_mempath,
+QEMU_OPTION_mem_prealloc
 };
 
 typedef struct QEMUOption {
@@ -8059,6 +8064,7 @@ const QEMUOption qemu_options[] = {
 { startdate, HAS_ARG, QEMU_OPTION_startdate },
 { tb-size, HAS_ARG, QEMU_OPTION_tb_size },
 { mem-path, HAS_ARG, QEMU_OPTION_mempath },
+{ mem-prealloc, 0, QEMU_OPTION_mem_prealloc },
 { NULL },
 };
 
@@ -8276,11 +8282,13 @@ static int gethugepagesize(void)
 return hugepagesize;
 }
 
-void *alloc_mem_area(unsigned long memory, const char *path)
+/* attempt to allocate memory mmap'ed to mem_path
+ */
+void *alloc_hpage_mem(unsigned long memory, const char *path)
 {
 char *filename;
 void *area;
-int fd;
+int fd, flags;
 
 if (asprintf(filename, %s/kvm.XX, path) == -1)
return NULL;
@@ -8308,26 +8316,27 @@ void *alloc_mem_area(unsigned long memor
  */
 ftruncate(fd, memory);
 
-area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
-if (area == MAP_FAILED) {
-   perror(mmap);
-   close(fd);
-   return NULL;
-}
-
-return area;
+/* NB: MAP_POPULATE won't exhaustively alloc all phys pages in the case
+ * MAP_PRIVATE is requested.  For mem_prealloc we mmap as MAP_SHARED
+ * to sidestep this quirk.
+ */
+flags = mem_prealloc ? MAP_POPULATE|MAP_SHARED : MAP_PRIVATE;
+area = mmap(0, memory, PROT_READ|PROT_WRITE, flags, fd, 0);
+if (area != MAP_FAILED)
+   return (area);
+perror(alloc_hpage_mem: can't mmap hugetlbfs pages);
+close(fd);
+return (NULL);
 }
 
-void *qemu_alloc_physram(unsigned long memory)
+/* allocate guest memory as requested
+ */
+void *qemu_alloc_physram(unsigned long size)
 {
-void *area = NULL;
-
 if (mem_path)
-   area = alloc_mem_area(memory, mem_path);
-if (!area)
-   area = qemu_vmalloc(memory);
-
-return area;
+   return (alloc_hpage_mem(size, mem_path));
+else
+   return (qemu_vmalloc(size));
 }
 
 int main(int argc, char **argv)
@@ -8962,6 +8971,9 @@ int main(int argc, char **argv)
 case QEMU_OPTION_mempath:
mem_path = optarg;
break;
+case QEMU_OPTION_mem_prealloc:
+   mem_prealloc = !mem_prealloc;
+   break;
 case QEMU_OPTION_name:
 qemu_name = optarg;
 break;


Re: patch: qemu + hugetlbfs..

2008-07-08 Thread john cooper

Anthony Liguori wrote:


+/* assertion checking
+ */
+#define ASSERT(c)   if (!(c)) _assert(#c, __FILE__, __LINE__)
+
+static inline void _assert(char *text, char *file, int line) {
+fprintf(stderr, ASSERTION FAILED: [%s] %s:%d\n, text, file, line);
+kill(getpid(), 12); /* dump core */
+}
+
  


Is it really necessary to add this as part of this patch?


No, certainly not strictly necessary.


+int mem_fallback = {0};/* allow alloc from alternate page size */
  
This is a bizarre way to initialize an integer.  I had no idea you could 
even do this for a scalar.  Just initialize to 1 and 0.


That's historically been valid syntax since pre-KR.  But certainly
can be adapted to the context style if it raises concern.


-void *alloc_mem_area(unsigned long memory, const char *path)
+/* fault in associated page, fail syscall when free page is unavailable
+ */ +static inline int fault_in(void *a)
+{
+return (gettimeofday(a, NULL)  0 ? errno != EFAULT : 1);
+}
  


I don't like this very much.  Why not just MAP_POPULATE?


I like it even less.  MAP_POPULATE does not fault in physical
hpages to the map.  Again this was a qemu-side interim bandaid.

+/* we failed to fault in hpage *a, fall back to conventional page 
mapping

+ */
+int remap_hpage(void *a, int sz)
+{
+ASSERT(!(sz  (EXEC_PAGESIZE - 1)));
+if (munmap(a, sz)  0)
+perror(remap_hpage: munmap);
+else if (mmap(a, sz, PROT_READ|PROT_WRITE,
+MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) == MAP_FAILED)
+perror(remap_hpage: mmap);
+else
+return (1);
+return (0);
+}
  


I think this would be simplified with MAP_POPULATE since you can fail in 
large chunks of memory instead of potentially having a highly fragmented 
set of VMAs.


Here for 4K pages we only need to setup the map.  If we later
fault on a physically absent 4K page we'll wait if a page isn't
immediately available.  Rather in the case of a hpage being
unavailable, we'll terminate.  Note at this point we've effectively
locked onto whatever hpages we've been able to map as they can't
be reclaimed from us until we exit.

Also it appears tab formatting in this patch may need to be
scrubbed some as the mailers seem to be jostling the whitespace.


+int bool_arg(const char *optarg, const char *flag_name)
+{
+if (!optarg)
+;
+else if (altmatch(y|yes|1, optarg))
+return (1);
+else if (altmatch(n|no|0, optarg))
+return (0);
+fprintf(stderr, usage: %s [0|1]\n, flag_name);
+exit(1);
 }
+   


This is introducing too much infrastructure.  Just follow the 
conventions in the rest of the code.  No need to make the options take a 
boolean argument.  The options themselves are boolean arguments.


That's how it originally existed.  However with the defaults
different for the two flags it seemed a bit clunky to have
one recall what the initial disposition of the option was and
to toggle its behavior if that wasn't the intention.

But admittedly I don't have a strong affinity either way.  I
was only concerned with usability and clarity of the flags.

-john



--
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html