Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-26 Thread Avi Kivity

On 05/25/2010 05:02 PM, Anthony Liguori wrote:

On 05/25/2010 08:57 AM, Avi Kivity wrote:

On 05/25/2010 04:54 PM, Anthony Liguori wrote:

On 05/25/2010 08:36 AM, Avi Kivity wrote:


We'd need a kernel-level generic snapshot API for this eventually.

or (2) implement BUSE to complement FUSE and CUSE to enable proper 
userspace block devices.


Likely slow due do lots of copying.  Also needs a snapshot API.


The kernel could use splice.


Still can't make guest memory appear in (A)BUSE process memory 
without either mmu tricks (vmsplice in reverse) or a copy.  May be 
workable for an (A)BUSE driver that talks over a network, and thus 
can splice() its way out.


splice() actually takes offset parameter so it may be possible to 
treat that offset parameter as a file offset.  That would essentially 
allow you to implement a splice() based thread pool where splice() 
replaces preadv/pwritev.


Right.

(note: need splicev() here)



It's not quite linux-aio, but it should take you pretty far.   I think 
the main point is that the problem of allowing block plugins to qemu 
is the same as block plugins for the kernel.  The kernel doesn't 
provide a stable interface (and we probably can't for the same 
reasons) and it's generally discourage from a code quality perspective.


The kernel does provide a stable interface for FUSE, and it could 
provide a stable interface for ABUSE.  Why can the kernel support these 
and qemu can't support essentially the same thing?


That said, making an external program work well as a block backend is 
identical to making userspace block devices fast.


More or less, yes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-26 Thread Avi Kivity

On 05/25/2010 08:12 PM, Sage Weil wrote:

On Tue, 25 May 2010, Avi Kivity wrote:
   

What's the reason for not having these drivers upstream? Do we gain
anything by hiding them from our users and requiring them to install the
drivers separately from somewhere else?

   

Six months.
 

FWIW, we (Ceph) aren't complaining about the 6 month lag time (and I don't
think the Sheepdog guys are either).

   


In that case (and if there are no other potential users), then there's 
no need for a plugin API.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/24/2010 10:38 PM, Anthony Liguori wrote:



- Building a plugin API seems a bit simpler to me, although I'm to
sure if I'd get the
   idea correctly:
   The block layer has already some kind of api (.bdrv_file_open, 
.bdrv_read). We
   could simply compile the block-drivers as shared objects and 
create a method

   for loading the necessary modules at runtime.


That approach would be a recipe for disaster.   We would have to 
introduce a new, reduced functionality block API that was supported 
for plugins.  Otherwise, the only way a plugin could keep up with our 
API changes would be if it was in tree which defeats the purpose of 
having plugins.


We could guarantee API/ABI stability in a stable branch but not across 
releases.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/24/2010 10:16 PM, Anthony Liguori wrote:

On 05/24/2010 06:56 AM, Avi Kivity wrote:

On 05/24/2010 02:42 PM, MORITA Kazutaka wrote:



The server would be local and talk over a unix domain socket, perhaps
anonymous.

nbd has other issues though, such as requiring a copy and no 
support for

metadata operations such as snapshot and file size extension.


Sorry, my explanation was unclear.  I'm not sure how running servers
on localhost can solve the problem.


The local server can convert from the local (nbd) protocol to the 
remote (sheepdog, ceph) protocol.



What I wanted to say was that we cannot specify the image of VM. With
nbd protocol, command line arguments are as follows:

  $ qemu nbd:hostname:port

As this syntax shows, with nbd protocol the client cannot pass the VM
image name to the server.


We would extend it to allow it to connect to a unix domain socket:

  qemu nbd:unix:/path/to/socket


nbd is a no-go because it only supports a single, synchronous I/O 
operation at a time and has no mechanism for extensibility.


If we go this route, I think two options are worth considering.  The 
first would be a purely socket based approach where we just accepted 
the extra copy.


The other potential approach would be shared memory based.  We export 
all guest ram as shared memory along with a small bounce buffer pool.  
We would then use a ring queue (potentially even using virtio-blk) and 
an eventfd for notification.


We can't actually export guest memory unless we allocate it as a shared 
memory object, which has many disadvantages.  The only way to export 
anonymous memory now is vmsplice(), which is fairly limited.





The server at the other end would associate the socket with a 
filename and forward it to the server using the remote protocol.


However, I don't think nbd would be a good protocol.  My preference 
would be for a plugin API, or for a new local protocol that uses 
splice() to avoid copies.


I think a good shared memory implementation would be preferable to 
plugins.  I think it's worth attempting to do a plugin interface for 
the block layer but I strongly suspect it would not be sufficient.


I would not want to see plugins that interacted with BlockDriverState 
directly, for instance.  We change it far too often.  Our main loop 
functions are also not terribly stable so I'm not sure how we would 
handle that (unless we forced all block plugins to be in a separate 
thread).


If we manage to make a good long-term stable plugin API, it would be a 
good candidate for the block layer itself.


Some OSes manage to have a stable block driver ABI, so it should be 
possible, if difficult.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/24/2010 10:19 PM, Anthony Liguori wrote:

On 05/24/2010 06:03 AM, Avi Kivity wrote:

On 05/24/2010 11:27 AM, Stefan Hajnoczi wrote:

On Sun, May 23, 2010 at 1:01 PM, Avi Kivitya...@redhat.com  wrote:

On 05/21/2010 12:29 AM, Anthony Liguori wrote:
I'd be more interested in enabling people to build these types of 
storage

systems without touching qemu.

Both sheepdog and ceph ultimately transmit I/O over a socket to a 
central

daemon, right?

That incurs an extra copy.

Besides a shared memory approach, I wonder if the splice() family of
syscalls could be used to send/receive data through a storage daemon
without the daemon looking at or copying the data?


Excellent idea.


splice() eventually requires a copy.  You cannot splice() to linux-aio 
so you'd have to splice() to a temporary buffer and then call into 
linux-aio.  With shared memory, you can avoid ever bringing the data 
into memory via O_DIRECT and linux-aio.


If the final destination is a socket, then you end up queuing guest 
memory as an skbuff.  In theory we could do an aio splice to block 
devices but I don't think that's realistic given our experience with aio 
changes.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Kevin Wolf
Am 23.05.2010 14:01, schrieb Avi Kivity:
 On 05/21/2010 12:29 AM, Anthony Liguori wrote:

 I'd be more interested in enabling people to build these types of 
 storage systems without touching qemu.

 Both sheepdog and ceph ultimately transmit I/O over a socket to a 
 central daemon, right? 
 
 That incurs an extra copy.
 
 So could we not standardize a protocol for this that both sheepdog and 
 ceph could implement?
 
 The protocol already exists, nbd.  It doesn't support snapshotting etc. 
 but we could extend it.
 
 But IMO what's needed is a plugin API for the block layer.

What would it buy us, apart from more downstreams and having to maintain
a stable API and ABI? Hiding block drivers somewhere else doesn't make
them stop existing, they just might not be properly integrated, but
rather hacked in to fit that limited stable API.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 02:02 PM, Kevin Wolf wrote:





So could we not standardize a protocol for this that both sheepdog and
ceph could implement?
   

The protocol already exists, nbd.  It doesn't support snapshotting etc.
but we could extend it.

But IMO what's needed is a plugin API for the block layer.
 

What would it buy us, apart from more downstreams and having to maintain
a stable API and ABI?


Currently if someone wants to add a new block format, they have to 
upstream it and wait for a new qemu to be released.  With a plugin API, 
they can add a new block format to an existing, supported qemu.



Hiding block drivers somewhere else doesn't make
them stop existing, they just might not be properly integrated, but
rather hacked in to fit that limited stable API.
   


They would hack it to fit the current API, and hack the API in qemu.git 
to fit their requirements for the next release.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Christoph Hellwig
On Tue, May 25, 2010 at 02:25:53PM +0300, Avi Kivity wrote:
 Currently if someone wants to add a new block format, they have to  
 upstream it and wait for a new qemu to be released.  With a plugin API,  
 they can add a new block format to an existing, supported qemu.

So?  Unless we want a stable driver ABI which I fundamentally oppose as
it would make block driver development hell they'd have to wait for
a new release of the block layer.  It's really just going to be a lot
of pain for no major gain.  qemu releases are frequent enough, and if
users care enough they can also easily patch qemu.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 03:03 PM, Christoph Hellwig wrote:

On Tue, May 25, 2010 at 02:25:53PM +0300, Avi Kivity wrote:
   

Currently if someone wants to add a new block format, they have to
upstream it and wait for a new qemu to be released.  With a plugin API,
they can add a new block format to an existing, supported qemu.
 

So?  Unless we want a stable driver ABI which I fundamentally oppose as
it would make block driver development hell


We'd only freeze it for a major release.


they'd have to wait for
a new release of the block layer.  It's really just going to be a lot
of pain for no major gain.  qemu releases are frequent enough, and if
users care enough they can also easily patch qemu.
   


May not be so easy for them, they lose binary updates from their distro 
and have to keep repatching.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 04:14 AM, Avi Kivity wrote:

On 05/24/2010 10:38 PM, Anthony Liguori wrote:



- Building a plugin API seems a bit simpler to me, although I'm to
sure if I'd get the
   idea correctly:
   The block layer has already some kind of api (.bdrv_file_open, 
.bdrv_read). We
   could simply compile the block-drivers as shared objects and 
create a method

   for loading the necessary modules at runtime.


That approach would be a recipe for disaster.   We would have to 
introduce a new, reduced functionality block API that was supported 
for plugins.  Otherwise, the only way a plugin could keep up with our 
API changes would be if it was in tree which defeats the purpose of 
having plugins.


We could guarantee API/ABI stability in a stable branch but not across 
releases.


We have releases every six months.  There would be tons of block plugins 
that didn't work for random sets of releases.  That creates a lot of 
user confusion and unhappiness.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 06:25 AM, Avi Kivity wrote:

On 05/25/2010 02:02 PM, Kevin Wolf wrote:





So could we not standardize a protocol for this that both sheepdog and
ceph could implement?

The protocol already exists, nbd.  It doesn't support snapshotting etc.
but we could extend it.

But IMO what's needed is a plugin API for the block layer.

What would it buy us, apart from more downstreams and having to maintain
a stable API and ABI?


Currently if someone wants to add a new block format, they have to 
upstream it and wait for a new qemu to be released.  With a plugin 
API, they can add a new block format to an existing, supported qemu.


Whether we have a plugin or protocol based mechanism to implement block 
formats really ends up being just an implementation detail.


In order to implement either, we need to take a subset of block 
functionality that we feel we can support long term and expose that.  
Right now, that's basically just querying characteristics (like size and 
geometry) and asynchronous reads and writes.


A protocol based mechanism has the advantage of being more robust in the 
face of poorly written block backends so if it's possible to make it 
perform as well as a plugin, it's a preferable approach.


Plugins that just expose chunks of QEMU internal state directly (like 
BlockDriver) are a really bad idea IMHO.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 04:17 PM, Anthony Liguori wrote:

On 05/25/2010 04:14 AM, Avi Kivity wrote:

On 05/24/2010 10:38 PM, Anthony Liguori wrote:



- Building a plugin API seems a bit simpler to me, although I'm to
sure if I'd get the
   idea correctly:
   The block layer has already some kind of api (.bdrv_file_open, 
.bdrv_read). We
   could simply compile the block-drivers as shared objects and 
create a method

   for loading the necessary modules at runtime.


That approach would be a recipe for disaster.   We would have to 
introduce a new, reduced functionality block API that was supported 
for plugins.  Otherwise, the only way a plugin could keep up with 
our API changes would be if it was in tree which defeats the purpose 
of having plugins.


We could guarantee API/ABI stability in a stable branch but not 
across releases.


We have releases every six months.  There would be tons of block 
plugins that didn't work for random sets of releases.  That creates a 
lot of user confusion and unhappiness.


The current situation is that those block format drivers only exist in 
qemu.git or as patches.  Surely that's even more unhappiness.


Confusion could be mitigated:

  $ qemu -module my-fancy-block-format-driver.so
  my-fancy-block-format-driver.so does not support this version of qemu 
(0.19.2).  Please contact my-fancy-block-format-driver-de...@example.org.


The question is how many such block format drivers we expect.  We now 
have two in the pipeline (ceph, sheepdog), it's reasonable to assume 
we'll want an lvm2 driver and btrfs driver.  This is an area with a lot 
of activity and a relatively simply interface.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread MORITA Kazutaka
At Mon, 24 May 2010 14:16:32 -0500,
Anthony Liguori wrote:
 
 On 05/24/2010 06:56 AM, Avi Kivity wrote:
  On 05/24/2010 02:42 PM, MORITA Kazutaka wrote:
 
  The server would be local and talk over a unix domain socket, perhaps
  anonymous.
 
  nbd has other issues though, such as requiring a copy and no support 
  for
  metadata operations such as snapshot and file size extension.
 
  Sorry, my explanation was unclear.  I'm not sure how running servers
  on localhost can solve the problem.
 
  The local server can convert from the local (nbd) protocol to the 
  remote (sheepdog, ceph) protocol.
 
  What I wanted to say was that we cannot specify the image of VM. With
  nbd protocol, command line arguments are as follows:
 
$ qemu nbd:hostname:port
 
  As this syntax shows, with nbd protocol the client cannot pass the VM
  image name to the server.
 
  We would extend it to allow it to connect to a unix domain socket:
 
qemu nbd:unix:/path/to/socket
 
 nbd is a no-go because it only supports a single, synchronous I/O 
 operation at a time and has no mechanism for extensibility.
 
 If we go this route, I think two options are worth considering.  The 
 first would be a purely socket based approach where we just accepted the 
 extra copy.
 
 The other potential approach would be shared memory based.  We export 
 all guest ram as shared memory along with a small bounce buffer pool.  
 We would then use a ring queue (potentially even using virtio-blk) and 
 an eventfd for notification.
 

The shared memory approach assumes that there is a local server who
can talk with the storage system.  But Ceph doesn't require the local
server, and Sheepdog would be extended to support VMs running outside
the storage system.  We could run a local daemon who can only work as
proxy, but I don't think it looks a clean approach.  So I think a
socket based approach is the right way to go.

BTW, is it required to design a common interface?  The way Sheepdog
replicates data is different from Ceph, so I think it is not possible
to define a common protocol as Christian says.

Regards,

Kazutaka

  The server at the other end would associate the socket with a filename 
  and forward it to the server using the remote protocol.
 
  However, I don't think nbd would be a good protocol.  My preference 
  would be for a plugin API, or for a new local protocol that uses 
  splice() to avoid copies.
 
 I think a good shared memory implementation would be preferable to 
 plugins.  I think it's worth attempting to do a plugin interface for the 
 block layer but I strongly suspect it would not be sufficient.
 
 I would not want to see plugins that interacted with BlockDriverState 
 directly, for instance.  We change it far too often.  Our main loop 
 functions are also not terribly stable so I'm not sure how we would 
 handle that (unless we forced all block plugins to be in a separate thread).
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 08:25 AM, Avi Kivity wrote:

On 05/25/2010 04:17 PM, Anthony Liguori wrote:

On 05/25/2010 04:14 AM, Avi Kivity wrote:

On 05/24/2010 10:38 PM, Anthony Liguori wrote:



- Building a plugin API seems a bit simpler to me, although I'm to
sure if I'd get the
   idea correctly:
   The block layer has already some kind of api (.bdrv_file_open, 
.bdrv_read). We
   could simply compile the block-drivers as shared objects and 
create a method

   for loading the necessary modules at runtime.


That approach would be a recipe for disaster.   We would have to 
introduce a new, reduced functionality block API that was supported 
for plugins.  Otherwise, the only way a plugin could keep up with 
our API changes would be if it was in tree which defeats the 
purpose of having plugins.


We could guarantee API/ABI stability in a stable branch but not 
across releases.


We have releases every six months.  There would be tons of block 
plugins that didn't work for random sets of releases.  That creates a 
lot of user confusion and unhappiness.


The current situation is that those block format drivers only exist in 
qemu.git or as patches.  Surely that's even more unhappiness.


Confusion could be mitigated:

  $ qemu -module my-fancy-block-format-driver.so
  my-fancy-block-format-driver.so does not support this version of 
qemu (0.19.2).  Please contact 
my-fancy-block-format-driver-de...@example.org.


The question is how many such block format drivers we expect.  We now 
have two in the pipeline (ceph, sheepdog), it's reasonable to assume 
we'll want an lvm2 driver and btrfs driver.  This is an area with a 
lot of activity and a relatively simply interface.


If we expose a simple interface, I'm all for it.  But BlockDriver is not 
simple and things like the snapshoting API need love.


Of course, there's certainly a question of why we're solving this in 
qemu at all.  Wouldn't it be more appropriate to either (1) implement a 
kernel module for ceph/sheepdog if performance matters or (2) implement 
BUSE to complement FUSE and CUSE to enable proper userspace block devices.


If you want to use a block device within qemu, you almost certainly want 
to be able to manipulate it on the host using standard tools (like mount 
and parted) so it stands to reason that addressing this in the kernel 
makes more sense.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 04:25 PM, Anthony Liguori wrote:
Currently if someone wants to add a new block format, they have to 
upstream it and wait for a new qemu to be released.  With a plugin 
API, they can add a new block format to an existing, supported qemu.



Whether we have a plugin or protocol based mechanism to implement 
block formats really ends up being just an implementation detail.


True.

In order to implement either, we need to take a subset of block 
functionality that we feel we can support long term and expose that.  
Right now, that's basically just querying characteristics (like size 
and geometry) and asynchronous reads and writes.


Unfortunately, you're right.

A protocol based mechanism has the advantage of being more robust in 
the face of poorly written block backends so if it's possible to make 
it perform as well as a plugin, it's a preferable approach.


May be hard due to difficulty of exposing guest memory.



Plugins that just expose chunks of QEMU internal state directly (like 
BlockDriver) are a really bad idea IMHO.


Also, we don't want to expose all of the qemu API.  We should default 
the visibility attribute to hidden and expose only select functions, 
perhaps under their own interface.  And no inlines.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 04:35 PM, Anthony Liguori wrote:

On 05/25/2010 08:31 AM, Avi Kivity wrote:
A protocol based mechanism has the advantage of being more robust in 
the face of poorly written block backends so if it's possible to 
make it perform as well as a plugin, it's a preferable approach.


May be hard due to difficulty of exposing guest memory.


If someone did a series to add plugins, I would expect a very strong 
argument as to why a shared memory mechanism was not possible or at 
least plausible.


I'm not sure I understand why shared memory is such a bad thing wrt 
KVM.  Can you elaborate?  Is it simply a matter of fork()?


fork() doesn't work in the with of memory hotplug.  What else is there?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 08:31 AM, Avi Kivity wrote:
A protocol based mechanism has the advantage of being more robust in 
the face of poorly written block backends so if it's possible to make 
it perform as well as a plugin, it's a preferable approach.


May be hard due to difficulty of exposing guest memory.


If someone did a series to add plugins, I would expect a very strong 
argument as to why a shared memory mechanism was not possible or at 
least plausible.


I'm not sure I understand why shared memory is such a bad thing wrt 
KVM.  Can you elaborate?  Is it simply a matter of fork()?




Plugins that just expose chunks of QEMU internal state directly (like 
BlockDriver) are a really bad idea IMHO.


Also, we don't want to expose all of the qemu API.  We should default 
the visibility attribute to hidden and expose only select functions, 
perhaps under their own interface.  And no inlines.


Yeah, if we did plugins, this would be a key requirement.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Kevin Wolf
Am 25.05.2010 15:25, schrieb Anthony Liguori:
 On 05/25/2010 06:25 AM, Avi Kivity wrote:
 On 05/25/2010 02:02 PM, Kevin Wolf wrote:


 So could we not standardize a protocol for this that both sheepdog and
 ceph could implement?
 The protocol already exists, nbd.  It doesn't support snapshotting etc.
 but we could extend it.

 But IMO what's needed is a plugin API for the block layer.
 What would it buy us, apart from more downstreams and having to maintain
 a stable API and ABI?

 Currently if someone wants to add a new block format, they have to 
 upstream it and wait for a new qemu to be released.  With a plugin 
 API, they can add a new block format to an existing, supported qemu.
 
 Whether we have a plugin or protocol based mechanism to implement block 
 formats really ends up being just an implementation detail.
 
 In order to implement either, we need to take a subset of block 
 functionality that we feel we can support long term and expose that.  
 Right now, that's basically just querying characteristics (like size and 
 geometry) and asynchronous reads and writes.
 
 A protocol based mechanism has the advantage of being more robust in the 
 face of poorly written block backends so if it's possible to make it 
 perform as well as a plugin, it's a preferable approach.
 
 Plugins that just expose chunks of QEMU internal state directly (like 
 BlockDriver) are a really bad idea IMHO.

I'm still not convinced that we need either. I share Christoph's concern
that we would make our life harder for almost no gain. It's probably a
very small group of users (if it exists at all) that wants to add new
block drivers themselves, but at the same time can't run upstream qemu.

But if we were to decide that there's no way around it, I agree with you
that directly exposing the internal API isn't going to work.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 08:36 AM, Avi Kivity wrote:


We'd need a kernel-level generic snapshot API for this eventually.

or (2) implement BUSE to complement FUSE and CUSE to enable proper 
userspace block devices.


Likely slow due do lots of copying.  Also needs a snapshot API.


The kernel could use splice.


(ABUSE was proposed a while ago by Zach).

If you want to use a block device within qemu, you almost certainly 
want to be able to manipulate it on the host using standard tools 
(like mount and parted) so it stands to reason that addressing this 
in the kernel makes more sense.


qemu-nbd also allows this.

This reasoning also applies to qcow2, btw.


I know.

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 08:38 AM, Avi Kivity wrote:

On 05/25/2010 04:35 PM, Anthony Liguori wrote:

On 05/25/2010 08:31 AM, Avi Kivity wrote:
A protocol based mechanism has the advantage of being more robust 
in the face of poorly written block backends so if it's possible to 
make it perform as well as a plugin, it's a preferable approach.


May be hard due to difficulty of exposing guest memory.


If someone did a series to add plugins, I would expect a very strong 
argument as to why a shared memory mechanism was not possible or at 
least plausible.


I'm not sure I understand why shared memory is such a bad thing wrt 
KVM.  Can you elaborate?  Is it simply a matter of fork()?


fork() doesn't work in the with of memory hotplug.  What else is there?



Is it that fork() doesn't work or is it that fork() is very expensive?

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 04:54 PM, Anthony Liguori wrote:

On 05/25/2010 08:36 AM, Avi Kivity wrote:


We'd need a kernel-level generic snapshot API for this eventually.

or (2) implement BUSE to complement FUSE and CUSE to enable proper 
userspace block devices.


Likely slow due do lots of copying.  Also needs a snapshot API.


The kernel could use splice.


Still can't make guest memory appear in (A)BUSE process memory without 
either mmu tricks (vmsplice in reverse) or a copy.  May be workable for 
an (A)BUSE driver that talks over a network, and thus can splice() its 
way out.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 04:55 PM, Anthony Liguori wrote:

On 05/25/2010 08:38 AM, Avi Kivity wrote:

On 05/25/2010 04:35 PM, Anthony Liguori wrote:

On 05/25/2010 08:31 AM, Avi Kivity wrote:
A protocol based mechanism has the advantage of being more robust 
in the face of poorly written block backends so if it's possible 
to make it perform as well as a plugin, it's a preferable approach.


May be hard due to difficulty of exposing guest memory.


If someone did a series to add plugins, I would expect a very strong 
argument as to why a shared memory mechanism was not possible or at 
least plausible.


I'm not sure I understand why shared memory is such a bad thing wrt 
KVM.  Can you elaborate?  Is it simply a matter of fork()?


fork() doesn't work in the with of memory hotplug.  What else is there?



Is it that fork() doesn't work or is it that fork() is very expensive?


It doesn't work, fork() is done at block device creation time, which 
freezes the child memory map, while guest memory is allocated at hotplug 
time.


fork() actually isn't very expensive since we use MADV_DONTFORK 
(probably fast enough for everything except realtime).


It may be possible to do a processfd() which can be mmap()ed by another 
process to export anonymous memory using mmu notifiers, not sure how 
easy or mergeable that is.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Kevin Wolf
Am 25.05.2010 15:25, schrieb Avi Kivity:
 On 05/25/2010 04:17 PM, Anthony Liguori wrote:
 On 05/25/2010 04:14 AM, Avi Kivity wrote:
 On 05/24/2010 10:38 PM, Anthony Liguori wrote:

 - Building a plugin API seems a bit simpler to me, although I'm to
 sure if I'd get the
idea correctly:
The block layer has already some kind of api (.bdrv_file_open, 
 .bdrv_read). We
could simply compile the block-drivers as shared objects and 
 create a method
for loading the necessary modules at runtime.

 That approach would be a recipe for disaster.   We would have to 
 introduce a new, reduced functionality block API that was supported 
 for plugins.  Otherwise, the only way a plugin could keep up with 
 our API changes would be if it was in tree which defeats the purpose 
 of having plugins.

 We could guarantee API/ABI stability in a stable branch but not 
 across releases.

 We have releases every six months.  There would be tons of block 
 plugins that didn't work for random sets of releases.  That creates a 
 lot of user confusion and unhappiness.
 
 The current situation is that those block format drivers only exist in 
 qemu.git or as patches.  Surely that's even more unhappiness.

The difference is that in the current situation these drivers will be
part of the next qemu release, so the patch may be obsolete, but you
don't even need it any more.

If you start keeping block drivers outside qemu and not even try
integrating them, they'll stay external.

 Confusion could be mitigated:
 
$ qemu -module my-fancy-block-format-driver.so
my-fancy-block-format-driver.so does not support this version of qemu 
 (0.19.2).  Please contact my-fancy-block-format-driver-de...@example.org.
 
 The question is how many such block format drivers we expect.  We now 
 have two in the pipeline (ceph, sheepdog), it's reasonable to assume 
 we'll want an lvm2 driver and btrfs driver.  This is an area with a lot 
 of activity and a relatively simply interface.

What's the reason for not having these drivers upstream? Do we gain
anything by hiding them from our users and requiring them to install the
drivers separately from somewhere else?

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 08:57 AM, Avi Kivity wrote:

On 05/25/2010 04:54 PM, Anthony Liguori wrote:

On 05/25/2010 08:36 AM, Avi Kivity wrote:


We'd need a kernel-level generic snapshot API for this eventually.

or (2) implement BUSE to complement FUSE and CUSE to enable proper 
userspace block devices.


Likely slow due do lots of copying.  Also needs a snapshot API.


The kernel could use splice.


Still can't make guest memory appear in (A)BUSE process memory without 
either mmu tricks (vmsplice in reverse) or a copy.  May be workable 
for an (A)BUSE driver that talks over a network, and thus can splice() 
its way out.


splice() actually takes offset parameter so it may be possible to treat 
that offset parameter as a file offset.  That would essentially allow 
you to implement a splice() based thread pool where splice() replaces 
preadv/pwritev.


It's not quite linux-aio, but it should take you pretty far.   I think 
the main point is that the problem of allowing block plugins to qemu is 
the same as block plugins for the kernel.  The kernel doesn't provide a 
stable interface (and we probably can't for the same reasons) and it's 
generally discourage from a code quality perspective.


That said, making an external program work well as a block backend is 
identical to making userspace block devices fast.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 08:55 AM, Avi Kivity wrote:

On 05/25/2010 04:53 PM, Kevin Wolf wrote:


I'm still not convinced that we need either. I share Christoph's concern
that we would make our life harder for almost no gain. It's probably a
very small group of users (if it exists at all) that wants to add new
block drivers themselves, but at the same time can't run upstream qemu.



The first part of your argument may be true, but the second isn't.  No 
user can run upstream qemu.git.  It's not tested or supported, and has 
no backwards compatibility guarantees.


Yes, it does have backwards compatibility guarantees.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 09:01 AM, Avi Kivity wrote:

On 05/25/2010 04:55 PM, Anthony Liguori wrote:

On 05/25/2010 08:38 AM, Avi Kivity wrote:

On 05/25/2010 04:35 PM, Anthony Liguori wrote:

On 05/25/2010 08:31 AM, Avi Kivity wrote:
A protocol based mechanism has the advantage of being more robust 
in the face of poorly written block backends so if it's possible 
to make it perform as well as a plugin, it's a preferable approach.


May be hard due to difficulty of exposing guest memory.


If someone did a series to add plugins, I would expect a very 
strong argument as to why a shared memory mechanism was not 
possible or at least plausible.


I'm not sure I understand why shared memory is such a bad thing wrt 
KVM.  Can you elaborate?  Is it simply a matter of fork()?


fork() doesn't work in the with of memory hotplug.  What else is there?



Is it that fork() doesn't work or is it that fork() is very expensive?


It doesn't work, fork() is done at block device creation time, which 
freezes the child memory map, while guest memory is allocated at 
hotplug time.


Now I'm confused.  I thought you were saying shared memory somehow 
affects fork().  If you're talking about shared memory inheritance via 
fork(), that's less important.  You can also pass /dev/shm fd's via 
SCM_RIGHTs to establish shared memory segments dynamically.


Regards,

Anthony Liguori

fork() actually isn't very expensive since we use MADV_DONTFORK 
(probably fast enough for everything except realtime).


It may be possible to do a processfd() which can be mmap()ed by 
another process to export anonymous memory using mmu notifiers, not 
sure how easy or mergeable that is.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Kevin Wolf
Am 25.05.2010 15:55, schrieb Avi Kivity:
 On 05/25/2010 04:53 PM, Kevin Wolf wrote:

 I'm still not convinced that we need either. I share Christoph's concern
 that we would make our life harder for almost no gain. It's probably a
 very small group of users (if it exists at all) that wants to add new
 block drivers themselves, but at the same time can't run upstream qemu.


 
 The first part of your argument may be true, but the second isn't.  No 
 user can run upstream qemu.git.  It's not tested or supported, and has 
 no backwards compatibility guarantees.

The second part was basically meant to say developers don't count here.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 05:05 PM, Anthony Liguori wrote:

On 05/25/2010 09:01 AM, Avi Kivity wrote:

On 05/25/2010 04:55 PM, Anthony Liguori wrote:

On 05/25/2010 08:38 AM, Avi Kivity wrote:

On 05/25/2010 04:35 PM, Anthony Liguori wrote:

On 05/25/2010 08:31 AM, Avi Kivity wrote:
A protocol based mechanism has the advantage of being more 
robust in the face of poorly written block backends so if it's 
possible to make it perform as well as a plugin, it's a 
preferable approach.


May be hard due to difficulty of exposing guest memory.


If someone did a series to add plugins, I would expect a very 
strong argument as to why a shared memory mechanism was not 
possible or at least plausible.


I'm not sure I understand why shared memory is such a bad thing 
wrt KVM.  Can you elaborate?  Is it simply a matter of fork()?


fork() doesn't work in the with of memory hotplug.  What else is 
there?




Is it that fork() doesn't work or is it that fork() is very expensive?


It doesn't work, fork() is done at block device creation time, which 
freezes the child memory map, while guest memory is allocated at 
hotplug time.


Now I'm confused.  I thought you were saying shared memory somehow 
affects fork().  If you're talking about shared memory inheritance via 
fork(), that's less important. 


The latter.  Why is it less important?  If you don't inherit the memory, 
you can't access it.


You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared 
memory segments dynamically.


Doesn't work for anonymous memory.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 05:09 PM, Kevin Wolf wrote:



The first part of your argument may be true, but the second isn't.  No
user can run upstream qemu.git.  It's not tested or supported, and has
no backwards compatibility guarantees.
 

The second part was basically meant to say developers don't count here.
   


Agreed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 10:00 AM, Avi Kivity wrote:
The latter.  Why is it less important?  If you don't inherit the 
memory, you can't access it.


You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared 
memory segments dynamically.


Doesn't work for anonymous memory.


What's wrong with /dev/shm memory?

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 06:01 PM, Anthony Liguori wrote:

On 05/25/2010 10:00 AM, Avi Kivity wrote:
The latter.  Why is it less important?  If you don't inherit the 
memory, you can't access it.


You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared 
memory segments dynamically.


Doesn't work for anonymous memory.


What's wrong with /dev/shm memory?


The kernel treats anonymous and nonymous memory differently for swapping 
(see /proc/sys/vm/swappiness); transparent hugepages won't work for 
/dev/shm (though it may be argued that that's a problem with thp); setup 
(/dev/shm defaults to half memory IIRC, we want mem+swap); different 
cgroup handling; somewhat clunky (a minor concern to be sure).


Nothing is a killer, but we should prefer anonymous memory.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Anthony Liguori

On 05/25/2010 11:16 AM, Avi Kivity wrote:

On 05/25/2010 06:01 PM, Anthony Liguori wrote:

On 05/25/2010 10:00 AM, Avi Kivity wrote:
The latter.  Why is it less important?  If you don't inherit the 
memory, you can't access it.


You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared 
memory segments dynamically.


Doesn't work for anonymous memory.


What's wrong with /dev/shm memory?


The kernel treats anonymous and nonymous memory differently for 
swapping (see /proc/sys/vm/swappiness); transparent hugepages won't 
work for /dev/shm (though it may be argued that that's a problem with 
thp); setup (/dev/shm defaults to half memory IIRC, we want mem+swap); 
different cgroup handling; somewhat clunky (a minor concern to be sure).


Surely, with mmu notifiers, it wouldn't be that hard to share anonymous 
memory via an fd though, no?


Regards,

Anthony Liguori



Nothing is a killer, but we should prefer anonymous memory.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 05:01 PM, Kevin Wolf wrote:



The current situation is that those block format drivers only exist in
qemu.git or as patches.  Surely that's even more unhappiness.
 

The difference is that in the current situation these drivers will be
part of the next qemu release, so the patch may be obsolete, but you
don't even need it any more.
   


The next qemu release may be six months in the future.  So if you're not 
happy with running qemu.git master or with patching a stable release, 
you have to wait.



If you start keeping block drivers outside qemu and not even try
integrating them, they'll stay external.
   


Which may or may not be a problem.


Confusion could be mitigated:

$ qemu -module my-fancy-block-format-driver.so
my-fancy-block-format-driver.so does not support this version of qemu
(0.19.2).  Please contact my-fancy-block-format-driver-de...@example.org.

The question is how many such block format drivers we expect.  We now
have two in the pipeline (ceph, sheepdog), it's reasonable to assume
we'll want an lvm2 driver and btrfs driver.  This is an area with a lot
of activity and a relatively simply interface.
 

What's the reason for not having these drivers upstream? Do we gain
anything by hiding them from our users and requiring them to install the
drivers separately from somewhere else?
   


Six months.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Avi Kivity

On 05/25/2010 07:21 PM, Anthony Liguori wrote:

On 05/25/2010 11:16 AM, Avi Kivity wrote:

On 05/25/2010 06:01 PM, Anthony Liguori wrote:

On 05/25/2010 10:00 AM, Avi Kivity wrote:
The latter.  Why is it less important?  If you don't inherit the 
memory, you can't access it.


You can also pass /dev/shm fd's via SCM_RIGHTs to establish shared 
memory segments dynamically.


Doesn't work for anonymous memory.


What's wrong with /dev/shm memory?


The kernel treats anonymous and nonymous memory differently for 
swapping (see /proc/sys/vm/swappiness); transparent hugepages won't 
work for /dev/shm (though it may be argued that that's a problem with 
thp); setup (/dev/shm defaults to half memory IIRC, we want 
mem+swap); different cgroup handling; somewhat clunky (a minor 
concern to be sure).


Surely, with mmu notifiers, it wouldn't be that hard to share 
anonymous memory via an fd though, no?


That's what I suggested with processfd().  I wouldn't call it easy but 
it's likely doable.  Whether it's mergable is a different issue.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Sage Weil
On Tue, 25 May 2010, Avi Kivity wrote:
  What's the reason for not having these drivers upstream? Do we gain
  anything by hiding them from our users and requiring them to install the
  drivers separately from somewhere else?
 
 
 Six months.

FWIW, we (Ceph) aren't complaining about the 6 month lag time (and I don't 
think the Sheepdog guys are either).

From our perspective, the current BlockDriver abstraction is ideal, as it 
represents the reality of qemu's interaction with storage.  Any 'external' 
interface will be inferior to that in one way or another.  But either way, 
we are perfectly willing to work with you to all to keep in sync with any 
future BlockDriver API improvements.  It is worth our time investment even 
if the API is less stable.

The ability to dynamically load a shared object using the existing api 
would make development a bit easier, but I'm not convinced it's better for 
for users.  I think having ceph and sheepdog upstream with qemu will serve 
end users best, and we at least are willing to spend the time to help 
maintain that code in qemu.git.

sage
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Blue Swirl
On Mon, May 24, 2010 at 2:17 AM, Yehuda Sadeh Weinraub
yehud...@gmail.com wrote:
 On Sun, May 23, 2010 at 12:59 AM, Blue Swirl blauwir...@gmail.com wrote:
 On Thu, May 20, 2010 at 11:02 PM, Yehuda Sadeh Weinraub
 yehud...@gmail.com wrote:
 On Thu, May 20, 2010 at 1:31 PM, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, May 19, 2010 at 7:22 PM, Christian Brunner c...@muc.de wrote:
 The attached patch is a block driver for the distributed file system
 Ceph (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace. Therefore it is
 called rbd - rados block device.
 ...

 IIRC underscores here may conflict with system header use. Please use
 something like QEMU_BLOCK_RADOS_H.

 This header is shared between the linux kernel client and the ceph
 userspace servers and client. We can actually get rid of it, as we
 only need it to define CEPH_OSD_TMAP_SET. We can move this definition
 to librados.h.

 diff --git a/block/rbd_types.h b/block/rbd_types.h
 new file mode 100644
 index 000..dfd5aa0
 --- /dev/null
 +++ b/block/rbd_types.h
 @@ -0,0 +1,48 @@
 +#ifndef _FS_CEPH_RBD
 +#define _FS_CEPH_RBD

 QEMU_BLOCK_RBD?

 This header is shared between the ceph kernel client, between the qemu
 rbd module (and between other ceph utilities). It'd be much easier
 maintaining it without having to have a different implementation for
 each. The same goes to the use of __le32/64 and __u32/64 within these
 headers.

 This is user space, so identifiers must conform to C standards. The
 identifiers beginning with underscores are reserved.

 Doesn't __le32/64 also depend on some GCC extension? Or sparse magic?
 It depends on gcc extension. If needed we can probably have a separate
 header for the qemu block device that uses alternative types. Though
 looking at the qemu code I see use of other gcc extensions so I'm not
 sure this is a real issue.

We use some (contained with for example macros if possible), but in
earlier discussions, __le32 etc. were considered problematic. IIRC
it's hard to provide alternate versions for other compilers (or older
versions of gcc).





 +
 +#include linux/types.h

 Can you use standard includes, like sys/types.h or inttypes.h? Are
 Ceph libraries used in other systems than Linux?

 Not at the moment. I guess that we can take this include out.


 +
 +/*
 + * rbd image 'foo' consists of objects
 + *   foo.rbd      - image metadata
 + *   foo.
 + *   foo.0001
 + *   ...          - data
 + */
 +
 +#define RBD_SUFFIX             .rbd
 +#define RBD_DIRECTORY           rbd_directory
 +
 +#define RBD_DEFAULT_OBJ_ORDER  22   /* 4MB */
 +
 +#define RBD_MAX_OBJ_NAME_SIZE  96
 +#define RBD_MAX_SEG_NAME_SIZE  128
 +
 +#define RBD_COMP_NONE          0
 +#define RBD_CRYPT_NONE         0
 +
 +static const char rbd_text[] =  Rados Block Device Image \n;
 +static const char rbd_signature[] = RBD;
 +static const char rbd_version[] = 001.001;
 +
 +struct rbd_obj_snap_ondisk {
 +       __le64 id;
 +       __le64 image_size;
 +} __attribute__((packed));
 +
 +struct rbd_obj_header_ondisk {
 +       char text[64];
 +       char signature[4];
 +       char version[8];
 +       __le64 image_size;

 Unaligned? Is the disk format fixed?

 This is a packed structure that represents the on disk format.
 Operations on it are being done only to read from the disk header or
 to write to the disk header.

 That's clear. But what exactly is the alignment of field 'image_size'?
 Could there be implicit padding to mod 8 between 'version' and
 'image_size' with some compilers?

 Obviously it's not 64 bit aligned. As it's an on-disk header, I don't
 see alignment a real issue. As was said before, any operation on these
 fields have to go through endianity conversion anyway, and this
 structure should not be used directly. For such datastructures I'd
 rather have the fields ordered in some logical order than maintaining
 the alignment by ourselves. That's why we have that __attribute__
 packed in the end to let the compiler deal with those issues. Other
 compilers though have their own syntax for packed structures (but I do
 see other uses of this packed syntax in the qemu code).

Packed structures are OK, but the padding should be explicit to avoid
compiler problems.

Eventually the disk format is read into memory buffer and then aligned
fields should be also faster on all architectures, even on x86.


 If there were no other constraints, I'd either make the padding
 explicit, or rearrange/resize fields so that the field alignment is
 natural. Thus my question, can you change the disk format or are there
 already some deployments?

 We can certainly make changes to the disk format at this point. I'm
 not very happy with those 3 __u8 in the middle, and they can probably
 be changed to a 32 bit flags field. We can get it 64 bit aligned too.

I hope my comments helped you to avoid possible problems in the

Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread MORITA Kazutaka
At Tue, 25 May 2010 10:12:53 -0700 (PDT),
Sage Weil wrote:
 
 On Tue, 25 May 2010, Avi Kivity wrote:
   What's the reason for not having these drivers upstream? Do we gain
   anything by hiding them from our users and requiring them to install the
   drivers separately from somewhere else?
  
  
  Six months.
 
 FWIW, we (Ceph) aren't complaining about the 6 month lag time (and I don't 
 think the Sheepdog guys are either).
 
I agree.  We aren't complaining about it.

 From our perspective, the current BlockDriver abstraction is ideal, as it 
 represents the reality of qemu's interaction with storage.  Any 'external' 
 interface will be inferior to that in one way or another.  But either way, 
 we are perfectly willing to work with you to all to keep in sync with any 
 future BlockDriver API improvements.  It is worth our time investment even 
 if the API is less stable.
 
I agree.

 The ability to dynamically load a shared object using the existing api 
 would make development a bit easier, but I'm not convinced it's better for 
 for users.  I think having ceph and sheepdog upstream with qemu will serve 
 end users best, and we at least are willing to spend the time to help 
 maintain that code in qemu.git.
 
I agree.

Regards,

Kazutaka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread MORITA Kazutaka
At Sun, 23 May 2010 15:01:59 +0300,
Avi Kivity wrote:
 
 On 05/21/2010 12:29 AM, Anthony Liguori wrote:
 
  I'd be more interested in enabling people to build these types of 
  storage systems without touching qemu.
 
  Both sheepdog and ceph ultimately transmit I/O over a socket to a 
  central daemon, right? 
 
 That incurs an extra copy.
 
  So could we not standardize a protocol for this that both sheepdog and 
  ceph could implement?
 
 The protocol already exists, nbd.  It doesn't support snapshotting etc. 
 but we could extend it.
 

I have no objection to use another protocol for Sheepdog support, but
I think nbd protocol is unsuitable for the large storage pool with
many VM images.  It is because nbd protocol doesn't support specifing
a file name to open.  If we use nbd with such a storage system, the
server needs to listen ports as many as the number of VM images.  As
far as I see the protocol, It looks difficult to extend it without
breaking backward compatibility.

Regards,

Kazutaka

 But IMO what's needed is a plugin API for the block layer.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Stefan Hajnoczi
On Sun, May 23, 2010 at 1:01 PM, Avi Kivity a...@redhat.com wrote:
 On 05/21/2010 12:29 AM, Anthony Liguori wrote:

 I'd be more interested in enabling people to build these types of storage
 systems without touching qemu.

 Both sheepdog and ceph ultimately transmit I/O over a socket to a central
 daemon, right?

 That incurs an extra copy.

Besides a shared memory approach, I wonder if the splice() family of
syscalls could be used to send/receive data through a storage daemon
without the daemon looking at or copying the data?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Avi Kivity

On 05/24/2010 11:27 AM, Stefan Hajnoczi wrote:

On Sun, May 23, 2010 at 1:01 PM, Avi Kivitya...@redhat.com  wrote:
   

On 05/21/2010 12:29 AM, Anthony Liguori wrote:
 

I'd be more interested in enabling people to build these types of storage
systems without touching qemu.

Both sheepdog and ceph ultimately transmit I/O over a socket to a central
daemon, right?
   

That incurs an extra copy.
 

Besides a shared memory approach, I wonder if the splice() family of
syscalls could be used to send/receive data through a storage daemon
without the daemon looking at or copying the data?
   


Excellent idea.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Avi Kivity

On 05/24/2010 10:12 AM, MORITA Kazutaka wrote:

At Sun, 23 May 2010 15:01:59 +0300,
Avi Kivity wrote:
   

On 05/21/2010 12:29 AM, Anthony Liguori wrote:
 

I'd be more interested in enabling people to build these types of
storage systems without touching qemu.

Both sheepdog and ceph ultimately transmit I/O over a socket to a
central daemon, right?
   

That incurs an extra copy.

 

So could we not standardize a protocol for this that both sheepdog and
ceph could implement?
   

The protocol already exists, nbd.  It doesn't support snapshotting etc.
but we could extend it.

 

I have no objection to use another protocol for Sheepdog support, but
I think nbd protocol is unsuitable for the large storage pool with
many VM images.  It is because nbd protocol doesn't support specifing
a file name to open.  If we use nbd with such a storage system, the
server needs to listen ports as many as the number of VM images.  As
far as I see the protocol, It looks difficult to extend it without
breaking backward compatibility.
   


The server would be local and talk over a unix domain socket, perhaps 
anonymous.


nbd has other issues though, such as requiring a copy and no support for 
metadata operations such as snapshot and file size extension.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread MORITA Kazutaka
At Mon, 24 May 2010 14:05:58 +0300,
Avi Kivity wrote:
 
 On 05/24/2010 10:12 AM, MORITA Kazutaka wrote:
  At Sun, 23 May 2010 15:01:59 +0300,
  Avi Kivity wrote:
 
  On 05/21/2010 12:29 AM, Anthony Liguori wrote:
   
  I'd be more interested in enabling people to build these types of
  storage systems without touching qemu.
 
  Both sheepdog and ceph ultimately transmit I/O over a socket to a
  central daemon, right?
 
  That incurs an extra copy.
 
   
  So could we not standardize a protocol for this that both sheepdog and
  ceph could implement?
 
  The protocol already exists, nbd.  It doesn't support snapshotting etc.
  but we could extend it.
 
   
  I have no objection to use another protocol for Sheepdog support, but
  I think nbd protocol is unsuitable for the large storage pool with
  many VM images.  It is because nbd protocol doesn't support specifing
  a file name to open.  If we use nbd with such a storage system, the
  server needs to listen ports as many as the number of VM images.  As
  far as I see the protocol, It looks difficult to extend it without
  breaking backward compatibility.
 
 
 The server would be local and talk over a unix domain socket, perhaps 
 anonymous.
 
 nbd has other issues though, such as requiring a copy and no support for 
 metadata operations such as snapshot and file size extension.
 

Sorry, my explanation was unclear.  I'm not sure how running servers
on localhost can solve the problem.

What I wanted to say was that we cannot specify the image of VM. With
nbd protocol, command line arguments are as follows:

 $ qemu nbd:hostname:port

As this syntax shows, with nbd protocol the client cannot pass the VM
image name to the server.

Regards,

Kazutaka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Avi Kivity

On 05/24/2010 02:42 PM, MORITA Kazutaka wrote:



The server would be local and talk over a unix domain socket, perhaps
anonymous.

nbd has other issues though, such as requiring a copy and no support for
metadata operations such as snapshot and file size extension.

 

Sorry, my explanation was unclear.  I'm not sure how running servers
on localhost can solve the problem.
   


The local server can convert from the local (nbd) protocol to the remote 
(sheepdog, ceph) protocol.



What I wanted to say was that we cannot specify the image of VM. With
nbd protocol, command line arguments are as follows:

  $ qemu nbd:hostname:port

As this syntax shows, with nbd protocol the client cannot pass the VM
image name to the server.
   


We would extend it to allow it to connect to a unix domain socket:

  qemu nbd:unix:/path/to/socket

The server at the other end would associate the socket with a filename 
and forward it to the server using the remote protocol.


However, I don't think nbd would be a good protocol.  My preference 
would be for a plugin API, or for a new local protocol that uses 
splice() to avoid copies.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Cláudio Martins

On Mon, 24 May 2010 14:56:29 +0300 Avi Kivity a...@redhat.com wrote:
 On 05/24/2010 02:42 PM, MORITA Kazutaka wrote:
 
  The server would be local and talk over a unix domain socket, perhaps
  anonymous.
 
  nbd has other issues though, such as requiring a copy and no support for
  metadata operations such as snapshot and file size extension.
 
   
  Sorry, my explanation was unclear.  I'm not sure how running servers
  on localhost can solve the problem.
 
 
 The local server can convert from the local (nbd) protocol to the remote 
 (sheepdog, ceph) protocol.
 

 Please note that this shouldn't be relevant to the block driver based
on ceph, as it does not use a local daemon -- it connects to the Object
Storage Devices directly over the network.

 Best regards

Cláudio

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread MORITA Kazutaka
At Mon, 24 May 2010 14:56:29 +0300,
Avi Kivity wrote:
 
 On 05/24/2010 02:42 PM, MORITA Kazutaka wrote:
 
  The server would be local and talk over a unix domain socket, perhaps
  anonymous.
 
  nbd has other issues though, such as requiring a copy and no support for
  metadata operations such as snapshot and file size extension.
 
   
  Sorry, my explanation was unclear.  I'm not sure how running servers
  on localhost can solve the problem.
 
 
 The local server can convert from the local (nbd) protocol to the remote 
 (sheepdog, ceph) protocol.
 
  What I wanted to say was that we cannot specify the image of VM. With
  nbd protocol, command line arguments are as follows:
 
$ qemu nbd:hostname:port
 
  As this syntax shows, with nbd protocol the client cannot pass the VM
  image name to the server.
 
 
 We would extend it to allow it to connect to a unix domain socket:
 
qemu nbd:unix:/path/to/socket
 
 The server at the other end would associate the socket with a filename 
 and forward it to the server using the remote protocol.
 

Thank you for the explanation.  Sheepdog could achieve desired
behavior by creating socket files for all the VM images when the
daemon starts up.

 However, I don't think nbd would be a good protocol.  My preference 
 would be for a plugin API, or for a new local protocol that uses 
 splice() to avoid copies.
 

Both would be okay for Sheepdog.  I want to take a suitable approach
for qemu.

Thanks,

Kazutaka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Christian Brunner
2010/5/24 MORITA Kazutaka morita.kazut...@lab.ntt.co.jp:

 However, I don't think nbd would be a good protocol.  My preference
 would be for a plugin API, or for a new local protocol that uses
 splice() to avoid copies.


 Both would be okay for Sheepdog.  I want to take a suitable approach
 for qemu.

I think both should be possible:

- Using splice() we would need a daemon that is listening on a control
socket for
  requests from qemu-processes or admin commands. When a qemu-process
  wants to open an image it could call open_image(protocol:imagename) on the
  controll socket and the daemon has to create a pipe to which the
image is mapped.
  (What I'm unsure about, are the security implications. Do we need some kind of
  authentication for the sockets? What about sVirt?

- Building a plugin API seems a bit simpler to me, although I'm to
sure if I'd get the
  idea correctly:
  The block layer has already some kind of api (.bdrv_file_open, .bdrv_read). We
  could simply compile the block-drivers as shared objects and create a method
  for loading the necessary modules at runtime.

Are you planing to use this for all block drivers?

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Anthony Liguori

On 05/24/2010 06:56 AM, Avi Kivity wrote:

On 05/24/2010 02:42 PM, MORITA Kazutaka wrote:



The server would be local and talk over a unix domain socket, perhaps
anonymous.

nbd has other issues though, such as requiring a copy and no support 
for

metadata operations such as snapshot and file size extension.


Sorry, my explanation was unclear.  I'm not sure how running servers
on localhost can solve the problem.


The local server can convert from the local (nbd) protocol to the 
remote (sheepdog, ceph) protocol.



What I wanted to say was that we cannot specify the image of VM. With
nbd protocol, command line arguments are as follows:

  $ qemu nbd:hostname:port

As this syntax shows, with nbd protocol the client cannot pass the VM
image name to the server.


We would extend it to allow it to connect to a unix domain socket:

  qemu nbd:unix:/path/to/socket


nbd is a no-go because it only supports a single, synchronous I/O 
operation at a time and has no mechanism for extensibility.


If we go this route, I think two options are worth considering.  The 
first would be a purely socket based approach where we just accepted the 
extra copy.


The other potential approach would be shared memory based.  We export 
all guest ram as shared memory along with a small bounce buffer pool.  
We would then use a ring queue (potentially even using virtio-blk) and 
an eventfd for notification.


The server at the other end would associate the socket with a filename 
and forward it to the server using the remote protocol.


However, I don't think nbd would be a good protocol.  My preference 
would be for a plugin API, or for a new local protocol that uses 
splice() to avoid copies.


I think a good shared memory implementation would be preferable to 
plugins.  I think it's worth attempting to do a plugin interface for the 
block layer but I strongly suspect it would not be sufficient.


I would not want to see plugins that interacted with BlockDriverState 
directly, for instance.  We change it far too often.  Our main loop 
functions are also not terribly stable so I'm not sure how we would 
handle that (unless we forced all block plugins to be in a separate thread).


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Anthony Liguori

On 05/24/2010 06:03 AM, Avi Kivity wrote:

On 05/24/2010 11:27 AM, Stefan Hajnoczi wrote:

On Sun, May 23, 2010 at 1:01 PM, Avi Kivitya...@redhat.com  wrote:

On 05/21/2010 12:29 AM, Anthony Liguori wrote:
I'd be more interested in enabling people to build these types of 
storage

systems without touching qemu.

Both sheepdog and ceph ultimately transmit I/O over a socket to a 
central

daemon, right?

That incurs an extra copy.

Besides a shared memory approach, I wonder if the splice() family of
syscalls could be used to send/receive data through a storage daemon
without the daemon looking at or copying the data?


Excellent idea.


splice() eventually requires a copy.  You cannot splice() to linux-aio 
so you'd have to splice() to a temporary buffer and then call into 
linux-aio.  With shared memory, you can avoid ever bringing the data 
into memory via O_DIRECT and linux-aio.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-24 Thread Anthony Liguori

On 05/24/2010 02:07 PM, Christian Brunner wrote:

2010/5/24 MORITA Kazutakamorita.kazut...@lab.ntt.co.jp:

   

However, I don't think nbd would be a good protocol.  My preference
would be for a plugin API, or for a new local protocol that uses
splice() to avoid copies.

   

Both would be okay for Sheepdog.  I want to take a suitable approach
for qemu.
 

I think both should be possible:

- Using splice() we would need a daemon that is listening on a control
socket for
   requests from qemu-processes or admin commands. When a qemu-process
   wants to open an image it could call open_image(protocol:imagename) on the
   controll socket and the daemon has to create a pipe to which the
image is mapped.
   (What I'm unsure about, are the security implications. Do we need some kind 
of
   authentication for the sockets? What about sVirt?
   


This is a fairly old patch that I dug out of a backup.  It uses the 9p 
protocol and does proper support for AIO.


At one point in time, I actually implemented splice() support but it 
didn't result in a significant improvement in benchmarks.



- Building a plugin API seems a bit simpler to me, although I'm to
sure if I'd get the
   idea correctly:
   The block layer has already some kind of api (.bdrv_file_open, .bdrv_read). 
We
   could simply compile the block-drivers as shared objects and create a method
   for loading the necessary modules at runtime.
   


That approach would be a recipe for disaster.   We would have to 
introduce a new, reduced functionality block API that was supported for 
plugins.  Otherwise, the only way a plugin could keep up with our API 
changes would be if it was in tree which defeats the purpose of having 
plugins.


Regards,

Anthony Liguori


Are you planing to use this for all block drivers?

Regards,
Christian
   


diff --git a/Makefile b/Makefile
index 4f7a55a..541b26a 100644
--- a/Makefile
+++ b/Makefile
@@ -53,7 +53,7 @@ BLOCK_OBJS=cutils.o qemu-malloc.o
 BLOCK_OBJS+=block-cow.o block-qcow.o aes.o block-vmdk.o block-cloop.o
 BLOCK_OBJS+=block-dmg.o block-bochs.o block-vpc.o block-vvfat.o
 BLOCK_OBJS+=block-qcow2.o block-parallels.o block-nbd.o
-BLOCK_OBJS+=nbd.o block.o aio.o
+BLOCK_OBJS+=nbd.o block.o aio.o block-9p.o p9.o p9c.o
 
 ifdef CONFIG_WIN32
 BLOCK_OBJS += block-raw-win32.o
diff --git a/block-9p.c b/block-9p.c
new file mode 100644
index 000..5570f37
--- /dev/null
+++ b/block-9p.c
@@ -0,0 +1,573 @@
+/*
+ * 9p based block driver for QEMU
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ *  Anthony Liguori   aligu...@us.ibm.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include qemu-common.h
+#include block_int.h
+#include p9c.h
+#include qemu_socket.h
+
+#include string.h
+#include stdlib.h
+#include errno.h
+
+//#define DEBUG_BLOCK_9P
+
+#ifdef DEBUG_BLOCK_9P
+#define dprintf(fmt, ...) \
+do { printf(block-9p:  fmt, ## __VA_ARGS__); } while (0)
+#define _dprintf(fmt, ...) \
+do { printf(fmt, ## __VA_ARGS__); } while (0)
+#else
+#define dprintf(fmt, ...) \
+do { } while (0)
+#define _dprintf(fmt, ...) \
+do { } while (0)
+#endif
+
+typedef struct BDRV9pState {
+P9IOState iops;
+BlockDriverState *bs;
+P9ClientState *client_state;
+int fd;
+char filename[1024];
+int nwnames;
+const char *wnames[256];
+int do_loop;
+int64_t length;
+int32_t msize;
+int count;
+} BDRV9pState;
+
+typedef struct P9AIOCB {
+BlockDriverAIOCB common;
+BDRV9pState *s;
+int64_t offset;
+size_t size;
+void *buf;
+} P9AIOCB;
+
+static void p9_recv_notify(void *opaque)
+{
+BDRV9pState *s = opaque;
+p9c_notify_can_recv(s-client_state);
+}
+
+static void p9_send_notify(void *opaque)
+{
+BDRV9pState *s = opaque;
+p9c_notify_can_send(s-client_state);
+}
+
+static BDRV9pState *to_bs(P9IOState *iops)
+{
+return container_of(iops, BDRV9pState, iops);
+}
+
+static ssize_t p9_send(P9IOState *iops, const void *data, size_t size)
+{
+BDRV9pState *s = to_bs(iops);
+ssize_t len;
+len = send(s-fd, data, size, 0);
+if (len == -1)
+errno = socket_error();
+return len;
+}
+
+static ssize_t p9_recv(P9IOState *iops, void *data, size_t size)
+{
+BDRV9pState *s = to_bs(iops);
+ssize_t len;
+len = recv(s-fd, data, size, 0);
+if (len == -1)
+errno = socket_error();
+return len;
+}
+
+static int p9_flush(void *opaque)
+{
+BDRV9pState *s = opaque;
+return !!s-count || s-do_loop;
+}
+
+static void p9_set_send_notify(P9IOState *iops, int enable)
+{
+BDRV9pState *s = to_bs(iops);
+
+if (enable)
+qemu_aio_set_fd_handler(s-fd, p9_recv_notify, p9_send_notify, 
p9_flush, s);
+else 
+qemu_aio_set_fd_handler(s-fd, p9_recv_notify, NULL, p9_flush, s);
+}
+
+static int p9_open_cb(void *opaque, int ret, const P9QID *qid, int32_t iounit)
+{
+BDRV9pState *s = opaque;
+
+if (ret) {
+  

Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-23 Thread Blue Swirl
On Thu, May 20, 2010 at 11:02 PM, Yehuda Sadeh Weinraub
yehud...@gmail.com wrote:
 On Thu, May 20, 2010 at 1:31 PM, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, May 19, 2010 at 7:22 PM, Christian Brunner c...@muc.de wrote:
 The attached patch is a block driver for the distributed file system
 Ceph (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace. Therefore it is
 called rbd - rados block device.
 ...

 IIRC underscores here may conflict with system header use. Please use
 something like QEMU_BLOCK_RADOS_H.

 This header is shared between the linux kernel client and the ceph
 userspace servers and client. We can actually get rid of it, as we
 only need it to define CEPH_OSD_TMAP_SET. We can move this definition
 to librados.h.

 diff --git a/block/rbd_types.h b/block/rbd_types.h
 new file mode 100644
 index 000..dfd5aa0
 --- /dev/null
 +++ b/block/rbd_types.h
 @@ -0,0 +1,48 @@
 +#ifndef _FS_CEPH_RBD
 +#define _FS_CEPH_RBD

 QEMU_BLOCK_RBD?

 This header is shared between the ceph kernel client, between the qemu
 rbd module (and between other ceph utilities). It'd be much easier
 maintaining it without having to have a different implementation for
 each. The same goes to the use of __le32/64 and __u32/64 within these
 headers.

This is user space, so identifiers must conform to C standards. The
identifiers beginning with underscores are reserved.

Doesn't __le32/64 also depend on some GCC extension? Or sparse magic?



 +
 +#include linux/types.h

 Can you use standard includes, like sys/types.h or inttypes.h? Are
 Ceph libraries used in other systems than Linux?

 Not at the moment. I guess that we can take this include out.


 +
 +/*
 + * rbd image 'foo' consists of objects
 + *   foo.rbd      - image metadata
 + *   foo.
 + *   foo.0001
 + *   ...          - data
 + */
 +
 +#define RBD_SUFFIX             .rbd
 +#define RBD_DIRECTORY           rbd_directory
 +
 +#define RBD_DEFAULT_OBJ_ORDER  22   /* 4MB */
 +
 +#define RBD_MAX_OBJ_NAME_SIZE  96
 +#define RBD_MAX_SEG_NAME_SIZE  128
 +
 +#define RBD_COMP_NONE          0
 +#define RBD_CRYPT_NONE         0
 +
 +static const char rbd_text[] =  Rados Block Device Image \n;
 +static const char rbd_signature[] = RBD;
 +static const char rbd_version[] = 001.001;
 +
 +struct rbd_obj_snap_ondisk {
 +       __le64 id;
 +       __le64 image_size;
 +} __attribute__((packed));
 +
 +struct rbd_obj_header_ondisk {
 +       char text[64];
 +       char signature[4];
 +       char version[8];
 +       __le64 image_size;

 Unaligned? Is the disk format fixed?

 This is a packed structure that represents the on disk format.
 Operations on it are being done only to read from the disk header or
 to write to the disk header.

That's clear. But what exactly is the alignment of field 'image_size'?
Could there be implicit padding to mod 8 between 'version' and
'image_size' with some compilers?

If there were no other constraints, I'd either make the padding
explicit, or rearrange/resize fields so that the field alignment is
natural. Thus my question, can you change the disk format or are there
already some deployments?

Otherwise, I'd just add some warning comment so people don't try to
use clever pointer tricks which will crash on machines with enforced
alignment.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-23 Thread Yehuda Sadeh Weinraub
On Sun, May 23, 2010 at 12:59 AM, Blue Swirl blauwir...@gmail.com wrote:
 On Thu, May 20, 2010 at 11:02 PM, Yehuda Sadeh Weinraub
 yehud...@gmail.com wrote:
 On Thu, May 20, 2010 at 1:31 PM, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, May 19, 2010 at 7:22 PM, Christian Brunner c...@muc.de wrote:
 The attached patch is a block driver for the distributed file system
 Ceph (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace. Therefore it is
 called rbd - rados block device.
 ...

 IIRC underscores here may conflict with system header use. Please use
 something like QEMU_BLOCK_RADOS_H.

 This header is shared between the linux kernel client and the ceph
 userspace servers and client. We can actually get rid of it, as we
 only need it to define CEPH_OSD_TMAP_SET. We can move this definition
 to librados.h.

 diff --git a/block/rbd_types.h b/block/rbd_types.h
 new file mode 100644
 index 000..dfd5aa0
 --- /dev/null
 +++ b/block/rbd_types.h
 @@ -0,0 +1,48 @@
 +#ifndef _FS_CEPH_RBD
 +#define _FS_CEPH_RBD

 QEMU_BLOCK_RBD?

 This header is shared between the ceph kernel client, between the qemu
 rbd module (and between other ceph utilities). It'd be much easier
 maintaining it without having to have a different implementation for
 each. The same goes to the use of __le32/64 and __u32/64 within these
 headers.

 This is user space, so identifiers must conform to C standards. The
 identifiers beginning with underscores are reserved.

 Doesn't __le32/64 also depend on some GCC extension? Or sparse magic?
It depends on gcc extension. If needed we can probably have a separate
header for the qemu block device that uses alternative types. Though
looking at the qemu code I see use of other gcc extensions so I'm not
sure this is a real issue.




 +
 +#include linux/types.h

 Can you use standard includes, like sys/types.h or inttypes.h? Are
 Ceph libraries used in other systems than Linux?

 Not at the moment. I guess that we can take this include out.


 +
 +/*
 + * rbd image 'foo' consists of objects
 + *   foo.rbd      - image metadata
 + *   foo.
 + *   foo.0001
 + *   ...          - data
 + */
 +
 +#define RBD_SUFFIX             .rbd
 +#define RBD_DIRECTORY           rbd_directory
 +
 +#define RBD_DEFAULT_OBJ_ORDER  22   /* 4MB */
 +
 +#define RBD_MAX_OBJ_NAME_SIZE  96
 +#define RBD_MAX_SEG_NAME_SIZE  128
 +
 +#define RBD_COMP_NONE          0
 +#define RBD_CRYPT_NONE         0
 +
 +static const char rbd_text[] =  Rados Block Device Image \n;
 +static const char rbd_signature[] = RBD;
 +static const char rbd_version[] = 001.001;
 +
 +struct rbd_obj_snap_ondisk {
 +       __le64 id;
 +       __le64 image_size;
 +} __attribute__((packed));
 +
 +struct rbd_obj_header_ondisk {
 +       char text[64];
 +       char signature[4];
 +       char version[8];
 +       __le64 image_size;

 Unaligned? Is the disk format fixed?

 This is a packed structure that represents the on disk format.
 Operations on it are being done only to read from the disk header or
 to write to the disk header.

 That's clear. But what exactly is the alignment of field 'image_size'?
 Could there be implicit padding to mod 8 between 'version' and
 'image_size' with some compilers?

Obviously it's not 64 bit aligned. As it's an on-disk header, I don't
see alignment a real issue. As was said before, any operation on these
fields have to go through endianity conversion anyway, and this
structure should not be used directly. For such datastructures I'd
rather have the fields ordered in some logical order than maintaining
the alignment by ourselves. That's why we have that __attribute__
packed in the end to let the compiler deal with those issues. Other
compilers though have their own syntax for packed structures (but I do
see other uses of this packed syntax in the qemu code).


 If there were no other constraints, I'd either make the padding
 explicit, or rearrange/resize fields so that the field alignment is
 natural. Thus my question, can you change the disk format or are there
 already some deployments?

We can certainly make changes to the disk format at this point. I'm
not very happy with those 3 __u8 in the middle, and they can probably
be changed to a 32 bit flags field. We can get it 64 bit aligned too.


 Otherwise, I'd just add some warning comment so people don't try to
 use clever pointer tricks which will crash on machines with enforced
 alignment.

Any clever pointer tricks that'll work on one architecture will
probably be wrong on another (different word
size/alignment/endianity), so maybe crashing machines is a good
indicator to bad implementation. We shouldn't try to hide the
problems.

Thanks,
Yehuda
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-21 Thread MORITA Kazutaka
At Fri, 21 May 2010 06:28:42 +0100,
Stefan Hajnoczi wrote:
 
 On Thu, May 20, 2010 at 11:16 PM, Christian Brunner c...@muc.de wrote:
  2010/5/20 Anthony Liguori anth...@codemonkey.ws:
  Both sheepdog and ceph ultimately transmit I/O over a socket to a central
  daemon, right?  So could we not standardize a protocol for this that both
  sheepdog and ceph could implement?
 
  There is no central daemon. The concept is that they talk to many
  storage nodes at the same time. Data is distributed and replicated
  over many nodes in the network. The mechanism to do this is quite
  complex. I don't know about sheepdog, but in Ceph this is called RADOS
  (reliable autonomic distributed object store). Sheepdog and Ceph may
  look similar, but this is where they act different. I don't think that
  it would be possible to implement a common protocol.
 
 I believe Sheepdog has a local daemon on each node.  The QEMU storage
 backend talks to the daemon on the same node, which then does the real
 network communication with the rest of the distributed storage system.

Yes.  It is because Sheepdog doesn't have a configuration about
cluster membership as I mentioned in another mail, so the drvier
doesn't know which node to access other than localhost.

  So I think we're not talking about a network protocol here, we're
 talking about a common interface that can be used by QEMU and other
 programs to take advantage of Ceph, Sheepdog, etc services available
 on the local node.
 
 Haven't looked into your patch enough yet, but does librados talk
 directly over the network or does it connect to a local daemon/driver?
 

AFAIK, librados access directly over the network, so I think it is
difficult to define a common interface.


Thanks,

Kazutaka

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Blue Swirl
On Wed, May 19, 2010 at 7:22 PM, Christian Brunner c...@muc.de wrote:
 The attached patch is a block driver for the distributed file system
 Ceph (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace. Therefore it is
 called rbd - rados block device.

 To compile the driver a recent version of ceph (= 0.20.1) is needed
 and you have to --enable-rbd when running configure.

 Additional information is available on the Ceph-Wiki:

 http://ceph.newdream.net/wiki/Kvm-rbd


I have no idea whether it makes sense to add Ceph (no objection
either). I have some minor comments below.


 ---
  Makefile          |    3 +
  Makefile.objs     |    1 +
  block/rados.h     |  376 ++
  block/rbd.c       |  585 
 +
  block/rbd_types.h |   48 +
  configure         |   27 +++
  6 files changed, 1040 insertions(+), 0 deletions(-)
  create mode 100644 block/rados.h
  create mode 100644 block/rbd.c
  create mode 100644 block/rbd_types.h

 diff --git a/Makefile b/Makefile
 index eb9e02b..b1ab3e9 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -27,6 +27,9 @@ configure: ;
  $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)

  LIBS+=-lz $(LIBS_TOOLS)
 +ifdef CONFIG_RBD
 +LIBS+=-lrados
 +endif

  ifdef BUILD_DOCS
  DOCS=qemu-doc.html qemu-tech.html qemu.1 qemu-img.1 qemu-nbd.8
 diff --git a/Makefile.objs b/Makefile.objs
 index acbaf22..85791ac 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -18,6 +18,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o
  block-nested-$(CONFIG_WIN32) += raw-win32.o
  block-nested-$(CONFIG_POSIX) += raw-posix.o
  block-nested-$(CONFIG_CURL) += curl.o
 +block-nested-$(CONFIG_RBD) += rbd.o

  block-obj-y +=  $(addprefix block/, $(block-nested-y))

 diff --git a/block/rados.h b/block/rados.h
 new file mode 100644
 index 000..6cde9a1
 --- /dev/null
 +++ b/block/rados.h
 @@ -0,0 +1,376 @@
 +#ifndef __RADOS_H
 +#define __RADOS_H

IIRC underscores here may conflict with system header use. Please use
something like QEMU_BLOCK_RADOS_H.

 +
 +/*
 + * Data types for the Ceph distributed object storage layer RADOS
 + * (Reliable Autonomic Distributed Object Store).
 + */
 +
 +
 +
 +/*
 + * osdmap encoding versions
 + */
 +#define CEPH_OSDMAP_INC_VERSION     5
 +#define CEPH_OSDMAP_INC_VERSION_EXT 5
 +#define CEPH_OSDMAP_VERSION         5
 +#define CEPH_OSDMAP_VERSION_EXT     5
 +
 +/*
 + * fs id
 + */
 +struct ceph_fsid {
 +       unsigned char fsid[16];

Too large indent, please check also elsewhere.

 +};
 +
 +static inline int ceph_fsid_compare(const struct ceph_fsid *a,
 +                                   const struct ceph_fsid *b)
 +{
 +       return memcmp(a, b, sizeof(*a));
 +}
 +
 +/*
 + * ino, object, etc.
 + */
 +typedef __le64 ceph_snapid_t;

Please use uint64_t and le_to_cpu()/cpu_to_le().

 +#define CEPH_SNAPDIR ((__u64)(-1))  /* reserved for hidden .snap dir */

Likewise, uint64_t is the standard type. Also other places.

 +#define CEPH_NOSNAP  ((__u64)(-2))  /* head, live revision */
 +#define CEPH_MAXSNAP ((__u64)(-3))  /* largest valid snapid */
 +
 +struct ceph_timespec {
 +       __le32 tv_sec;
 +       __le32 tv_nsec;
 +} __attribute__ ((packed));
 +
 +
 +/*
 + * object layout - how objects are mapped into PGs
 + */
 +#define CEPH_OBJECT_LAYOUT_HASH     1
 +#define CEPH_OBJECT_LAYOUT_LINEAR   2
 +#define CEPH_OBJECT_LAYOUT_HASHINO  3
 +
 +/*
 + * pg layout -- how PGs are mapped onto (sets of) OSDs
 + */
 +#define CEPH_PG_LAYOUT_CRUSH  0
 +#define CEPH_PG_LAYOUT_HASH   1
 +#define CEPH_PG_LAYOUT_LINEAR 2
 +#define CEPH_PG_LAYOUT_HYBRID 3
 +
 +
 +/*
 + * placement group.
 + * we encode this into one __le64.
 + */
 +struct ceph_pg {
 +       __le16 preferred; /* preferred primary osd */
 +       __le16 ps;        /* placement seed */
 +       __le32 pool;      /* object pool */
 +} __attribute__ ((packed));
 +
 +/*
 + * pg_pool is a set of pgs storing a pool of objects
 + *
 + *  pg_num -- base number of pseudorandomly placed pgs
 + *
 + *  pgp_num -- effective number when calculating pg placement.  this
 + * is used for pg_num increases.  new pgs result in data being split
 + * into new pgs.  for this to proceed smoothly, new pgs are intiially
 + * colocated with their parents; that is, pgp_num doesn't increase
 + * until the new pgs have successfully split.  only _then_ are the new
 + * pgs placed independently.
 + *
 + *  lpg_num -- localized pg count (per device).  replicas are randomly
 + * selected.
 + *
 + *  lpgp_num -- as above.
 + */
 +#define CEPH_PG_TYPE_REP     1
 +#define CEPH_PG_TYPE_RAID4   2
 +#define CEPH_PG_POOL_VERSION 2
 +struct ceph_pg_pool {
 +       __u8 type;                /* CEPH_PG_TYPE_* */
 +       __u8 size;                /* number of osds in each pg */
 +       __u8 crush_ruleset;       /* crush placement rule */
 +       __u8 object_hash;         /* hash mapping 

Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Christian Brunner
2010/5/20 Blue Swirl blauwir...@gmail.com:
 On Wed, May 19, 2010 at 7:22 PM, Christian Brunner c...@muc.de wrote:
 The attached patch is a block driver for the distributed file system
 Ceph (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace. Therefore it is
 called rbd - rados block device.

 To compile the driver a recent version of ceph (= 0.20.1) is needed
 and you have to --enable-rbd when running configure.

 Additional information is available on the Ceph-Wiki:

 http://ceph.newdream.net/wiki/Kvm-rbd


 I have no idea whether it makes sense to add Ceph (no objection
 either). I have some minor comments below.

Thanks for your comments. I'll send an updated patch in a few days.

Having a central storage system is quite essential in larger hosting
environments, it enables you to move your guest systems from one node
to another easily (live-migration or dynamic restart). Traditionally
this has been done using SAN, iSCSI or NFS. However most of these
systems don't scale very well and and the costs for high-availability
are quite high.

With new approaches like Sheepdog or Ceph, things are getting a lot
cheaper and you can scale your system without disrupting your service.
The concepts are quite similar to what Amazon is doing in their EC2
environment, but they certainly won't publish it as OpenSource anytime
soon.

Both projects have advantages and disadvantages. Ceph is a bit more
universal as it implements a whole filesystem. Sheepdog is more
feature complete in regards of managing images (e.g. snapshots). Both
projects require some additional work to become stable, but they are
on a good way.

I would really like to see both drivers in the qemu tree, as they are
the key to a design shift in how storage in the datacenter is being
built.

Christian
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Anthony Liguori

On 05/20/2010 04:18 PM, Christian Brunner wrote:

Thanks for your comments. I'll send an updated patch in a few days.

Having a central storage system is quite essential in larger hosting
environments, it enables you to move your guest systems from one node
to another easily (live-migration or dynamic restart). Traditionally
this has been done using SAN, iSCSI or NFS. However most of these
systems don't scale very well and and the costs for high-availability
are quite high.

With new approaches like Sheepdog or Ceph, things are getting a lot
cheaper and you can scale your system without disrupting your service.
The concepts are quite similar to what Amazon is doing in their EC2
environment, but they certainly won't publish it as OpenSource anytime
soon.

Both projects have advantages and disadvantages. Ceph is a bit more
universal as it implements a whole filesystem. Sheepdog is more
feature complete in regards of managing images (e.g. snapshots). Both
projects require some additional work to become stable, but they are
on a good way.

I would really like to see both drivers in the qemu tree, as they are
the key to a design shift in how storage in the datacenter is being
built.
   


I'd be more interested in enabling people to build these types of 
storage systems without touching qemu.


Both sheepdog and ceph ultimately transmit I/O over a socket to a 
central daemon, right?  So could we not standardize a protocol for this 
that both sheepdog and ceph could implement?


Regards,

Anthony Liguori


Christian
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Christian Brunner
2010/5/20 Anthony Liguori anth...@codemonkey.ws:
 With new approaches like Sheepdog or Ceph, things are getting a lot
 cheaper and you can scale your system without disrupting your service.
 The concepts are quite similar to what Amazon is doing in their EC2
 environment, but they certainly won't publish it as OpenSource anytime
 soon.

 Both projects have advantages and disadvantages. Ceph is a bit more
 universal as it implements a whole filesystem. Sheepdog is more
 feature complete in regards of managing images (e.g. snapshots). Both
 projects require some additional work to become stable, but they are
 on a good way.

 I would really like to see both drivers in the qemu tree, as they are
 the key to a design shift in how storage in the datacenter is being
 built.


 I'd be more interested in enabling people to build these types of storage
 systems without touching qemu.

You could do this by using Yehuda's rbd kernel driver, but I think
that it would be better to avoid this additional layer.

 Both sheepdog and ceph ultimately transmit I/O over a socket to a central
 daemon, right?  So could we not standardize a protocol for this that both
 sheepdog and ceph could implement?

There is no central daemon. The concept is that they talk to many
storage nodes at the same time. Data is distributed and replicated
over many nodes in the network. The mechanism to do this is quite
complex. I don't know about sheepdog, but in Ceph this is called RADOS
(reliable autonomic distributed object store). Sheepdog and Ceph may
look similar, but this is where they act different. I don't think that
it would be possible to implement a common protocol.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Yehuda Sadeh Weinraub
On Thu, May 20, 2010 at 1:31 PM, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, May 19, 2010 at 7:22 PM, Christian Brunner c...@muc.de wrote:
 The attached patch is a block driver for the distributed file system
 Ceph (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace. Therefore it is
 called rbd - rados block device.
...

 IIRC underscores here may conflict with system header use. Please use
 something like QEMU_BLOCK_RADOS_H.

This header is shared between the linux kernel client and the ceph
userspace servers and client. We can actually get rid of it, as we
only need it to define CEPH_OSD_TMAP_SET. We can move this definition
to librados.h.

 diff --git a/block/rbd_types.h b/block/rbd_types.h
 new file mode 100644
 index 000..dfd5aa0
 --- /dev/null
 +++ b/block/rbd_types.h
 @@ -0,0 +1,48 @@
 +#ifndef _FS_CEPH_RBD
 +#define _FS_CEPH_RBD

 QEMU_BLOCK_RBD?

This header is shared between the ceph kernel client, between the qemu
rbd module (and between other ceph utilities). It'd be much easier
maintaining it without having to have a different implementation for
each. The same goes to the use of __le32/64 and __u32/64 within these
headers.


 +
 +#include linux/types.h

 Can you use standard includes, like sys/types.h or inttypes.h? Are
 Ceph libraries used in other systems than Linux?

Not at the moment. I guess that we can take this include out.


 +
 +/*
 + * rbd image 'foo' consists of objects
 + *   foo.rbd      - image metadata
 + *   foo.
 + *   foo.0001
 + *   ...          - data
 + */
 +
 +#define RBD_SUFFIX             .rbd
 +#define RBD_DIRECTORY           rbd_directory
 +
 +#define RBD_DEFAULT_OBJ_ORDER  22   /* 4MB */
 +
 +#define RBD_MAX_OBJ_NAME_SIZE  96
 +#define RBD_MAX_SEG_NAME_SIZE  128
 +
 +#define RBD_COMP_NONE          0
 +#define RBD_CRYPT_NONE         0
 +
 +static const char rbd_text[] =  Rados Block Device Image \n;
 +static const char rbd_signature[] = RBD;
 +static const char rbd_version[] = 001.001;
 +
 +struct rbd_obj_snap_ondisk {
 +       __le64 id;
 +       __le64 image_size;
 +} __attribute__((packed));
 +
 +struct rbd_obj_header_ondisk {
 +       char text[64];
 +       char signature[4];
 +       char version[8];
 +       __le64 image_size;

 Unaligned? Is the disk format fixed?

This is a packed structure that represents the on disk format.
Operations on it are being done only to read from the disk header or
to write to the disk header.


Yehuda
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Stefan Hajnoczi
On Thu, May 20, 2010 at 11:16 PM, Christian Brunner c...@muc.de wrote:
 2010/5/20 Anthony Liguori anth...@codemonkey.ws:
 Both sheepdog and ceph ultimately transmit I/O over a socket to a central
 daemon, right?  So could we not standardize a protocol for this that both
 sheepdog and ceph could implement?

 There is no central daemon. The concept is that they talk to many
 storage nodes at the same time. Data is distributed and replicated
 over many nodes in the network. The mechanism to do this is quite
 complex. I don't know about sheepdog, but in Ceph this is called RADOS
 (reliable autonomic distributed object store). Sheepdog and Ceph may
 look similar, but this is where they act different. I don't think that
 it would be possible to implement a common protocol.

I believe Sheepdog has a local daemon on each node.  The QEMU storage
backend talks to the daemon on the same node, which then does the real
network communication with the rest of the distributed storage system.
 So I think we're not talking about a network protocol here, we're
talking about a common interface that can be used by QEMU and other
programs to take advantage of Ceph, Sheepdog, etc services available
on the local node.

Haven't looked into your patch enough yet, but does librados talk
directly over the network or does it connect to a local daemon/driver?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread MORITA Kazutaka
At Fri, 21 May 2010 00:16:46 +0200,
Christian Brunner wrote:
 
 2010/5/20 Anthony Liguori anth...@codemonkey.ws:
  With new approaches like Sheepdog or Ceph, things are getting a lot
  cheaper and you can scale your system without disrupting your service.
  The concepts are quite similar to what Amazon is doing in their EC2
  environment, but they certainly won't publish it as OpenSource anytime
  soon.
 
  Both projects have advantages and disadvantages. Ceph is a bit more
  universal as it implements a whole filesystem. Sheepdog is more
  feature complete in regards of managing images (e.g. snapshots). Both

I think a major difference is that Sheepdog servers act fully
autonomously.  Any Sheepdog server has no fixed role such as a monitor
server, and Sheepdog doesn't require any configuration about a list of
nodes in the cluster.


  projects require some additional work to become stable, but they are
  on a good way.
 
  I would really like to see both drivers in the qemu tree, as they are
  the key to a design shift in how storage in the datacenter is being
  built.
 
 
  I'd be more interested in enabling people to build these types of storage
  systems without touching qemu.
 
 You could do this by using Yehuda's rbd kernel driver, but I think
 that it would be better to avoid this additional layer.
 

I agree.  In addition, if a storage client is a qemu driver, the
storage system can support some features specific to qemu such as live
snapshot from qemu monitor.

Regards,

Kazutaka

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html