from:"ronnie sahlberg"

Re: [Qemu-devel] [PATCH V2 2/2] block/nfs: add support for nfs_umount

2019-09-13 Thread ronnie sahlberg

On Wed, Sep 11, 2019 at 5:48 PM Max Reitz  wrote:
>
> On 10.09.19 17:41, Peter Lieven wrote:
> > libnfs recently added support for unmounting. Add support
> > in Qemu too.
> >
> > Signed-off-by: Peter Lieven 
> > ---
> >  block/nfs.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/block/nfs.c b/block/nfs.c
> > index 2c98508275..f39acfdb28 100644
> > --- a/block/nfs.c
> > +++ b/block/nfs.c
> > @@ -398,6 +398,9 @@ static void nfs_client_close(NFSClient *client)
> >  nfs_close(client->context, client->fh);
> >  client->fh = NULL;
> >  }
> > +#ifdef LIBNFS_FEATURE_UMOUNT
> > +nfs_umount(client->context);
> > +#endif
> >  nfs_destroy_context(client->context);
> >  client->context = NULL;
> >  }
>
> I don’t understand what unmounting means in this context.  Is it just
> generic clean-up for NFSv3 (it appears that it’s a no-op for NFSv4)?
> Why isn’t that done by nfs_destroy_context()?

Umount is weird since there isn't actually any state in NFSv3 and
"mounting" in nfsv3 is really just a matter of converting the path to
be mounted into a filehandle.
That is all the mount protocol is really used for.

This is all handled in a separate protocol/server called rpc.mountd
that is separate from NFSd. Running as a different process and
listening to a different port.
And the only purpose of rpc.mountd is to take a path to a share and
return a nfsv3 filehandle to the root of that path.
As a side-effect, rpc.mountd also keeps track of which clients have
called MNT but not yet UMNT and thus showmount -a
can give a lost of all client that have "mounted" the share but not
yet called "UMNT".

It has no effect at all on NFSv3 and is purely cosmetic. This ONLY
affects "showmount -a" output.
NFSv4 does away with all these separate protocols such as mount,
statd, nlm and even portmapper
so there is not even a concept of showmount -a for nfsv4.

As the libnfs maintainer, why did I do nfs_umount() the way I did?
First of all, I think of nfs UMNT as really just cosmetic and didn't
want to put too much work into it. But people wanted it.

I implemented it as a sync function since I think few people would
actually use it at all and it meant that I just didn't have to invest
in having to build an async piupelinje.

I did NOT implement it inside nfs_destroy_context() since that
function is supposed to be, in my view, non-blocking andn should just
tear the connection and immediately return.
As unmount would be
* close the tcp socket to the nfs server
* open new socket to portmapper and ask "where is rpc.mountd"
* close socket to portmapper, then open new socket to rpc.mountd
 * tell rpc.mountd to remove us from the showmount -a list
* close socket

I just took the cheap and easy path. Do it as a sync function with my
own eventloop.

Again, UMNT has no real effect on anything related to NFS except what
showmount -a will return. That is one big reason why
I was just not much motivated enough to build it as an async function.

Once we all switch to NFSv4 this will all be moot since the MOUNT
protocol no longer exists and neither does rpc.mountd.

>
> Max
>

Re: [Qemu-devel] [PATCH] block/nfs: add support for nfs_umount

2019-09-05 Thread ronnie sahlberg

On Thu, Sep 5, 2019 at 8:16 PM Peter Lieven  wrote:
>
> Am 05.09.19 um 12:05 schrieb ronnie sahlberg:
> > On Thu, Sep 5, 2019 at 7:43 PM Peter Lieven  wrote:
> >> Am 04.09.19 um 11:34 schrieb Kevin Wolf:
> >>> Am 03.09.2019 um 21:52 hat Peter Lieven geschrieben:
> >>>>> Am 03.09.2019 um 16:56 schrieb Kevin Wolf :
> >>>>>
> >>>>> Am 03.09.2019 um 15:44 hat Peter Lieven geschrieben:
> >>>>>> libnfs recently added support for unmounting. Add support
> >>>>>> in Qemu too.
> >>>>>>
> >>>>>> Signed-off-by: Peter Lieven 
> >>>>> Looks trivial enough to review even for me. :-)
> >>>>>
> >>>>> Thanks, applied to the block branch.
> >>>>>
> >>>>> Kevin
> >>>> I am not sure what the reason is, but with this patch I sometimes run
> >>>> into nfs_process_read being called for a cdrom mounted from nfs after
> >>>> I ejected it (and the whole nfs client context is already destroyed).
> >>> Does this mean that nfs_umount() gets some response, but we don't
> >>> properly wait for it? Or is some older request still in flight?
> >>
> >> nfs_umount itself is a sync call and should only terminate when
> >>
> >> the call is done. But there is an independent I/O handler in that
> >>
> >> function polling on the fd. (wait_for_nfs_reply in libnfs-sync.c).
> >>
> >> This is why I thought the right solution is to stop the Qemu I/O handler
> >>
> >> before calling nfs_close and nfs_umount. nfs_close also uses this
> >>
> >> sync I/O handler, but for some reason it seems not to make trouble.
> >>
> >>
> >> The other solution would be to use the async versions of close and umount,
> >>
> >> but that would make the code in Qemu more complex.
> >>
> >>
> > NFS umount is pretty messy so I think you should continue using the
> > sync version.
> > In NFSv3 (there is no mount protocol in v4)  the Mount call (fetch the
> > root filehandle)
> > and the Umount calls (tell server we should no longer show up in
> > showexports -a output)
> > are not part of the NFS protocol but a different service running on a
> > separate port.
> >
> > This does not map well to libnfs since it is centered around a "struct
> > nfs_context".
> >
> > To use nfs_umount() from QEMU I would suggest :
> > 1, make sure all commands in flight have finished, because you will
> > soon disconnect from the NFS server and will never receive any
> > in-flight responses.
> > 2, unregister the nfs->fh filedescriptor from your eventsystem.
> > Because the fd is about to be closed so there is great chance it will
> > be recycled for a completely different purpose if you open any other
> > files from qemu.
> >
> > 3, call nfs_umount()   Internally this will close the socket to the
> > NFS server, then go through thr process to open a new socket to the
> > portmapper to discover the mount server, then close that socket and
> > reconnect a new socket again to the mount server and perform the UMNT
> > call.
>
>
> What we currently do in Qemu is:
>
>
> 1) bdrv_drain
>
> 2) bdrv_close which in the end calls nfs_client_close from block/nfs.c.
>
>There we call:
>
>2a) nfs_close(client->fh)
>
>2b) aio_set_fd_handler(NULL)
>
>2c) nfs_destroy_context(client->context);
>
>
> My first patch added a nfs_umount between 2a) and 2b) so that we have
>
>2a) nfs_close(client->fh)
>
>2b) nfs_umount(client->context)
>
>2c) aio_set_fd_handler(NULL)
>
>2d) nfs_destroy_context(client->context);
>
>
> This leads to triggering to assertion for an uninitialized client->mutex 
> which is called from an invocation
>
> of nfs_process_read after nfs_destroy_context was called.
>
>
> If I change the order as following I see no more assertions:
>
>2a) aio_set_fd_handler(NULL)
>
>2b) nfs_close(client->fh)
>
>2c) nfs_umount(client->context)
>
>2d) nfs_destroy_context(client->context);

That makes sense and looks correct to me.

>
>
> I think we should have done this in the first place, because nfs_close (and 
> nfs_umount) poll on the nfs_fd in parallel
>
> if we use the sync calls.
>
>
> Peter
>
>
>

Re: [Qemu-devel] [PATCH] block/nfs: add support for nfs_umount

2019-09-05 Thread ronnie sahlberg

On Thu, Sep 5, 2019 at 7:43 PM Peter Lieven  wrote:
>
> Am 04.09.19 um 11:34 schrieb Kevin Wolf:
> > Am 03.09.2019 um 21:52 hat Peter Lieven geschrieben:
> >>
> >>> Am 03.09.2019 um 16:56 schrieb Kevin Wolf :
> >>>
> >>> Am 03.09.2019 um 15:44 hat Peter Lieven geschrieben:
> >>>> libnfs recently added support for unmounting. Add support
> >>>> in Qemu too.
> >>>>
> >>>> Signed-off-by: Peter Lieven 
> >>> Looks trivial enough to review even for me. :-)
> >>>
> >>> Thanks, applied to the block branch.
> >>>
> >>> Kevin
> >> I am not sure what the reason is, but with this patch I sometimes run
> >> into nfs_process_read being called for a cdrom mounted from nfs after
> >> I ejected it (and the whole nfs client context is already destroyed).
> > Does this mean that nfs_umount() gets some response, but we don't
> > properly wait for it? Or is some older request still in flight?
>
>
> nfs_umount itself is a sync call and should only terminate when
>
> the call is done. But there is an independent I/O handler in that
>
> function polling on the fd. (wait_for_nfs_reply in libnfs-sync.c).
>
> This is why I thought the right solution is to stop the Qemu I/O handler
>
> before calling nfs_close and nfs_umount. nfs_close also uses this
>
> sync I/O handler, but for some reason it seems not to make trouble.
>
>
> The other solution would be to use the async versions of close and umount,
>
> but that would make the code in Qemu more complex.
>
>

NFS umount is pretty messy so I think you should continue using the
sync version.
In NFSv3 (there is no mount protocol in v4)  the Mount call (fetch the
root filehandle)
and the Umount calls (tell server we should no longer show up in
showexports -a output)
are not part of the NFS protocol but a different service running on a
separate port.

This does not map well to libnfs since it is centered around a "struct
nfs_context".

To use nfs_umount() from QEMU I would suggest :
1, make sure all commands in flight have finished, because you will
soon disconnect from the NFS server and will never receive any
in-flight responses.
2, unregister the nfs->fh filedescriptor from your eventsystem.
Because the fd is about to be closed so there is great chance it will
be recycled for a completely different purpose if you open any other
files from qemu.

3, call nfs_umount()   Internally this will close the socket to the
NFS server, then go through thr process to open a new socket to the
portmapper to discover the mount server, then close that socket and
reconnect a new socket again to the mount server and perform the UMNT
call.

ronnie sahlberg

> Peter
>
>

Re: [Qemu-devel] iSER transport name is not good

2017-12-07 Thread ronnie sahlberg

David,

Yes, QEMU has supported iSER (iSCSI extensions for RDMA) via a
userspace library, libiscsi, since about a year.
Here is a nice presentation from the developer of iSER support :
https://www.google.com.au/url?sa=t=j==s=web=1=rja=8=0ahUKEwjChpba3PnXAhXBKZQKHQa9DEYQFggpMAA=https%3A%2F%2Fwww.snia.org%2Fsites%2Fdefault%2Ffiles%2FSDC%2F2016%2Fpresentations%2Fstorage_networking%2FShterman-Grimberg_Greenberg_Performance%2520Implications%2520Libiscsi_%2520RDMA_V6.pdf=AOvVaw3xyvdIciRKVprboN6rClTA

In QEMU, you use it the same way as you would use the userspace iSCSI
support which has been in QEMU for quite a long time.
The only difference is that instead of providing a
iscsi: you select iSER by specifying an url of the
form
iser:

I.e. the only difference is the protocol part of the URL. When QEMU
passes this URL to libiscsi, it allows the library to decide on
whether to use normal iSCSI or whether it should use iSER.

I think RDMA would be a less good name for this as RDMA is not only
used to transport iSCSI but is also used for NFS as well as SMB.

regards
ronnie sahlberg

On Fri, Dec 8, 2017 at 4:41 AM, Dr. David Alan Gilbert
<dgilb...@redhat.com> wrote:
> * Charles Kelimod (lichs...@gmail.com) wrote:
>> Hello,
>>
>>
>> I'm on the road of modifying libvirt to add iSER support, and I found qemu
>> is actually support that. But, the transport name is iSER, which should be
>> RDMA and there libvirt already defined it.
>>
>> Best Regards,
>> Charles.
>
> Hi Charles,
>   So this is iSCSI extensions for RDMA?  I've cc'd in the qemu iSCSI
> maintainers.
>   Can you just explain what you're seeing libvirt do and what you think
> qemu is expecting it to do for iSER?
>
> Dave
>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH 3/8] qemu-options qemu-doc: Move "Device URL Syntax" to qemu-doc

2017-10-04 Thread ronnie sahlberg

On Wed, Oct 4, 2017 at 8:12 PM, Marc-André Lureau
<marcandre.lur...@gmail.com> wrote:
> On Mon, Oct 2, 2017 at 4:03 PM, Markus Armbruster <arm...@redhat.com> wrote:
>> Commit 0f5314a (v1.0) added section "Device URL Syntax" to
>> qemu-options.hx.  It's enclosed in STEXI..ETEXI, thus affects only
>> qemu-options.texi, not --help.  It appears as a subsection under
>> section "Invocation".  Similarly, qemu.1 has it as a subsection under
>> "OPTIONS".
>>
>> Commit f9dadc9 (v1.1.0) dropped new option -iscsi into the middle of
>> this section.  No effect on qemu-options.texi.  It appears in --help
>> run together with the "Bluetooth(R) options:" header.
>>
>> Commit c70a01e (v1.5.0) gives it is own heading in --help by moving
>> commit 0f5314a's DEFHEADING(Device URL Syntax:) outside STEXI..ETEXI.
>> Trouble is the heading makes no sense for -iscsi.
>>
>> Move all of the "Device URL Syntax" Texinfo to qemu-doc.texi.  Mark it
>> for inclusion in qemu.1 with '@c man begin NOTES'.  This turns it into
>> a separate section outside the list of options both in qemu-doc and in
>> qemu.1.
>>
>> There's substantial overlap with the existing qemu-doc section "Disk
>> Images".  Mark with a TODO comment.
>>
>> Output of --help will be fixed next.
>>
>> Cc: Ronnie Sahlberg <ronniesahlb...@gmail.com>
>> Cc: Kevin Wolf <kw...@redhat.com>
>> Cc: Max Reitz <mre...@redhat.com>
>> Cc: qemu-bl...@nongnu.org
>> Signed-off-by: Markus Armbruster <arm...@redhat.com>
>
> Reviewed-by: Marc-André Lureau <marcandre.lur...@redhat.com>
>
>
>> ---
>>  qemu-doc.texi   | 217 ++
>>  qemu-options.hx | 222 
>> 
>>  2 files changed, 217 insertions(+), 222 deletions(-)
>>
>> diff --git a/qemu-doc.texi b/qemu-doc.texi
>> index ecd186a159..848e49966a 100644
>> --- a/qemu-doc.texi
>> +++ b/qemu-doc.texi
>> @@ -245,6 +245,223 @@ targets do not need a disk image.
>>
>>  @c man end
>>
>> +@node device_url
>> +@subsection Device URL Syntax
>> +@c TODO merge this with section Disk Images
>> +
>> +@c man begin NOTES
>> +
>> +In addition to using normal file images for the emulated storage devices,
>> +QEMU can also use networked resources such as iSCSI devices. These are
>> +specified using a special URL syntax.
>> +
>> +@table @option
>> +@item iSCSI
>> +iSCSI support allows QEMU to access iSCSI resources directly and use as
>> +images for the guest storage. Both disk and cdrom images are supported.
>> +
>> +Syntax for specifying iSCSI LUNs is
>> +``iscsi://[:]//''
>> +
>> +By default qemu will use the iSCSI initiator-name
>> +'iqn.2008-11.org.linux-kvm[:]' but this can also be set from the 
>> command
>> +line or a configuration file.
>> +
>> +Since version Qemu 2.4 it is possible to specify a iSCSI request timeout to 
>> detect
>> +stalled requests and force a reestablishment of the session. The timeout
>> +is specified in seconds. The default is 0 which means no timeout. Libiscsi
>> +1.15.0 or greater is required for this feature.
>> +
>> +Example (without authentication):
>> +@example
>> +qemu-system-i386 -iscsi initiator-name=iqn.2001-04.com.example:my-initiator 
>> \
>> + -cdrom iscsi://192.0.2.1/iqn.2001-04.com.example/2 \
>> + -drive file=iscsi://192.0.2.1/iqn.2001-04.com.example/1
>> +@end example
>> +
>> +Example (CHAP username/password via URL):
>> +@example
>> +qemu-system-i386 -drive 
>> file=iscsi://user%password@@192.0.2.1/iqn.2001-04.com.example/1
>> +@end example
>> +
>> +Example (CHAP username/password via environment variables):
>> +@example
>> +LIBISCSI_CHAP_USERNAME="user" \
>> +LIBISCSI_CHAP_PASSWORD="password" \
>> +qemu-system-i386 -drive file=iscsi://192.0.2.1/iqn.2001-04.com.example/1
>> +@end example
>> +
>> +@item NBD
>> +QEMU supports NBD (Network Block Devices) both using TCP protocol as well
>> +as Unix Domain Sockets.
>> +
>> +Syntax for specifying a NBD device using TCP
>> +``nbd::[:exportname=]''
>> +
>> +Syntax for specifying a NBD device using Unix Domain Sockets
>> +``nbd:unix:[:exportname=]''
>> +
>> +Example for TCP
>> +@example
>> +qemu-system-i386 --drive file=nbd:192.0.2.1:3
>> +@end example
>&g

Re: [Qemu-devel] [PATCH 4/8] qemu-options: Move -iscsi under "Block device options"

2017-10-04 Thread ronnie sahlberg

Reviewed-by: Ronnie Sahlberg <ronniesahlb...@gmail.com>

Would be nice if this died at some stage :
[,password=password]

But that is for a different patch.

On Wed, Oct 4, 2017 at 8:12 PM, Marc-André Lureau
<marcandre.lur...@gmail.com> wrote:
> On Mon, Oct 2, 2017 at 4:03 PM, Markus Armbruster <arm...@redhat.com> wrote:
>> -iscsi ended up under the "Device URL Syntax" heading by a sequence of
>> errors, as explained in the previous commit.  Move it under the "Block
>> device options" heading.  Nothing left under "Device URL Syntax";
>> drop the heading.
>>
>> Cc: Ronnie Sahlberg <ronniesahlb...@gmail.com>
>> Cc: Kevin Wolf <kw...@redhat.com>
>> Cc: Max Reitz <mre...@redhat.com>
>> Cc: qemu-bl...@nongnu.org
>> Signed-off-by: Markus Armbruster <arm...@redhat.com>
>
> Reviewed-by: Marc-André Lureau <marcandre.lur...@redhat.com>
>
>
>> ---
>>  qemu-options.hx | 15 +++
>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index f112281d37..c647fdde62 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -1172,6 +1172,13 @@ STEXI
>>  Create synthetic file system image
>>  ETEXI
>>
>> +DEF("iscsi", HAS_ARG, QEMU_OPTION_iscsi,
>> +"-iscsi [user=user][,password=password]\n"
>> +"   [,header-digest=CRC32C|CR32C-NONE|NONE-CRC32C|NONE\n"
>> +"   [,initiator-name=initiator-iqn][,id=target-iqn]\n"
>> +"   [,timeout=timeout]\n"
>> +"iSCSI session parameters\n", QEMU_ARCH_ALL)
>> +
>>  STEXI
>>  @end table
>>  ETEXI
>> @@ -2811,14 +2818,6 @@ STEXI
>>  ETEXI
>>  DEFHEADING()
>>
>> -DEFHEADING(Device URL Syntax:)
>> -DEF("iscsi", HAS_ARG, QEMU_OPTION_iscsi,
>> -"-iscsi [user=user][,password=password]\n"
>> -"   [,header-digest=CRC32C|CR32C-NONE|NONE-CRC32C|NONE\n"
>> -"   [,initiator-name=initiator-iqn][,id=target-iqn]\n"
>> -"   [,timeout=timeout]\n"
>> -"iSCSI session parameters\n", QEMU_ARCH_ALL)
>> -
>>  DEFHEADING(Bluetooth(R) options:)
>>  STEXI
>>  @table @option
>> --
>> 2.13.6
>>
>>
>
>
>
> --
> Marc-André Lureau

Re: [Qemu-devel] [PATCH 5/8] qemu-options: Add missing -iscsi Texinfo documentation

2017-10-04 Thread ronnie sahlberg

Reviewed-by: Ronnie Sahlberg <ronniesahlb...@gmail.com>

On Tue, Oct 3, 2017 at 12:03 AM, Markus Armbruster <arm...@redhat.com> wrote:
> Cc: Ronnie Sahlberg <ronniesahlb...@gmail.com>
> Cc: Kevin Wolf <kw...@redhat.com>
> Cc: Max Reitz <mre...@redhat.com>
> Cc: qemu-bl...@nongnu.org
> Signed-off-by: Markus Armbruster <arm...@redhat.com>
> ---
>  qemu-options.hx | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/qemu-options.hx b/qemu-options.hx
> index c647fdde62..8568ee388c 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -1180,6 +1180,12 @@ DEF("iscsi", HAS_ARG, QEMU_OPTION_iscsi,
>  "iSCSI session parameters\n", QEMU_ARCH_ALL)
>
>  STEXI
> +@item -iscsi
> +@findex -iscsi
> +Configure iSCSI session parameters.
> +ETEXI
> +
> +STEXI
>  @end table
>  ETEXI
>  DEFHEADING()
> --
> 2.13.6
>

Re: [Qemu-devel] iSER-Libiscsi presentation from SDC2016

2016-09-21 Thread ronnie sahlberg

Congratulation to a fantastic presentation Roy.

Very impressive data!

On Wed, Sep 21, 2016 at 11:39 AM, Roy Shterman  wrote:
> Hi all,
>
>
>
> Attached presentation was presented in SDC 2016 in Santa-Clara and actually
> summaries the iSER implementation inside Libiscsi/QEMU.
>
>
>
> You can find here performance results comparison between different iSCSI
> initiator deployments - iSER gives very high performance when working with
> Libiscsi library.
>
>
>
> I would love to hear comments/suggestions.
>
>
>
> Thanks,
>
> Roy
>
>

Re: [Qemu-devel] [PATCH] block/iscsi: Adding iser support in Libiscsi-QEMU

2016-08-22 Thread ronnie sahlberg

It is never too late.

I can start working on a patch to add "iser://" URL support to
libiscsi right now. It should be a trivial change.
I think I would prefer iser:// instead of iscsi+iser:// but it is not
religiusly. Let me know if you rather want iscsi+iser.


But you would still need some changes to Roy's patch to QEMU, right?
I.e. I think you would need an additional, new block driver :

static BlockDriver bdrv_iser = {
 .format_name = "iser",
.protocol_name   = "iser",
...

No ?



On Mon, Aug 1, 2016 at 6:50 AM, Paolo Bonzini  wrote:
>
>
> On 27/07/2016 12:02, Roy Shterman wrote:
>> iSER is a new transport layer supported in Libiscsi,
>> iSER provides a zero-copy RDMA capable interface that can
>> improve performance.
>>
>> New API is introduced in abstracion of the Libiscsi transport layer.
>> In order to use the new iSER transport, one need to add the ?iser option
>> at the end of Libiscsi URI.
>
> Hi, is it too late to use the URI scheme instead---for example
> iscsi+iser://.../... ?  In any case this should not affect the QEMU bits.
>
> Paolo
>
>> For now iSER memory buffers are pre-allocated and pre-registered,
>> hence in order to work with iSER from QEMU, one need to enable MEMLOCK
>> attribute in the VM to be large enough for all iSER buffers and RDMA
>> resources.
>>
>> A new functionallity is also introduced in this commit, a new API
>> to deploy zero-copy command submission. iSER is differing from TCP in
>> data-path, hence IO vectors must be transferred already when queueing
>> the PDU.
>>
>> Signed-off-by: Roy Shterman 
>> ---
>>  block/iscsi.c |   45 +
>>  1 files changed, 41 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/iscsi.c b/block/iscsi.c
>> index 7e78ade..6b95636 100644
>> --- a/block/iscsi.c
>> +++ b/block/iscsi.c
>> @@ -41,6 +41,7 @@
>>  #include "qapi/qmp/qstring.h"
>>  #include "crypto/secret.h"
>>
>> +#include "qemu/uri.h"
>>  #include 
>>  #include 
>>
>> @@ -484,6 +485,18 @@ iscsi_co_writev_flags(BlockDriverState *bs, int64_t 
>> sector_num, int nb_sectors,
>>  iscsi_co_init_iscsitask(iscsilun, );
>>  retry:
>>  if (iscsilun->use_16_for_rw) {
>> +#if LIBISCSI_API_VERSION >= (20160603)
>> +iTask.task = iscsi_write16_iov_task(iscsilun->iscsi, iscsilun->lun, 
>> lba,
>> +NULL, num_sectors * 
>> iscsilun->block_size,
>> +iscsilun->block_size, 0, 0, 
>> fua, 0, 0,
>> +iscsi_co_generic_cb, , 
>> (struct scsi_iovec *)iov->iov, iov->niov);
>> +} else {
>> +iTask.task = iscsi_write10_iov_task(iscsilun->iscsi, iscsilun->lun, 
>> lba,
>> +NULL, num_sectors * 
>> iscsilun->block_size,
>> +iscsilun->block_size, 0, 0, 
>> fua, 0, 0,
>> +iscsi_co_generic_cb, , 
>> (struct scsi_iovec *)iov->iov, iov->niov);
>> +}
>> +#else
>>  iTask.task = iscsi_write16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>  NULL, num_sectors * 
>> iscsilun->block_size,
>>  iscsilun->block_size, 0, 0, fua, 0, 
>> 0,
>> @@ -494,11 +507,14 @@ retry:
>>  iscsilun->block_size, 0, 0, fua, 0, 
>> 0,
>>  iscsi_co_generic_cb, );
>>  }
>> +#endif
>>  if (iTask.task == NULL) {
>>  return -ENOMEM;
>>  }
>> +#if LIBISCSI_API_VERSION < (20160603)
>>  scsi_task_set_iov_out(iTask.task, (struct scsi_iovec *) iov->iov,
>>iov->niov);
>> +#endif
>>  while (!iTask.complete) {
>>  iscsi_set_events(iscsilun);
>>  qemu_coroutine_yield();
>> @@ -677,6 +693,19 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState 
>> *bs,
>>  iscsi_co_init_iscsitask(iscsilun, );
>>  retry:
>>  if (iscsilun->use_16_for_rw) {
>> +#if LIBISCSI_API_VERSION >= (20160603)
>> +iTask.task = iscsi_read16_iov_task(iscsilun->iscsi, iscsilun->lun, 
>> lba,
>> +   num_sectors * 
>> iscsilun->block_size,
>> +   iscsilun->block_size, 0, 0, 0, 
>> 0, 0,
>> +   iscsi_co_generic_cb, , 
>> (struct scsi_iovec *)iov->iov, iov->niov);
>> +} else {
>> +iTask.task = iscsi_read10_iov_task(iscsilun->iscsi, iscsilun->lun, 
>> lba,
>> +   num_sectors * 
>> iscsilun->block_size,
>> +   iscsilun->block_size,
>> +   0, 0, 0, 0, 0,
>> +   iscsi_co_generic_cb, , 
>> (struct scsi_iovec *)iov->iov, iov->niov);
>> +}
>> +#else
>>  iTask.task

Re: [Qemu-devel] [Qemu-block] [PATCH] send readcapacity10 when readcapacity16 failed

2016-01-05 Thread ronnie sahlberg

MMC devices:
READ CAPACITY 10 support is mandatory.
No support for READ CAPACITY 16

SBC devices:
READ CAPACITY 10 is mandatory
READ CAPACITY 16 support is only required when you have thin provisioning
or protection information (or if the device is >2^32 blocks)
Almost all, but apparently not all, SBC devices support both.

For SBC devices you probably want to start with RC16 and only fallback to
RC10 if you get INVALID_OPCODE.
You start with RC16 since this is the way to discover if you have thin
provisioning or special protection information.

For MMC devices you could try the "try RC16 first and fallback to RC10" but
as probably almost no MMC devices support RC16 it makes little sense to do
so.

On Tue, Jan 5, 2016 at 11:42 AM, John Snow  wrote:

>
>
> On 12/28/2015 10:32 PM, Zhu Lingshan wrote:
> > When play with Dell MD3000 target, for sure it
> > is a TYPE_DISK, but readcapacity16 would fail.
> > Then we find that readcapacity10 succeeded. It
> > looks like the target just support readcapacity10
> > even through it is a TYPE_DISK or have some
> > TYPE_ROM characteristics.
> >
> > This patch can give a chance to send
> > readcapacity16 when readcapacity10 failed.
> > This patch is not harmful to original pathes
> >
> > Signed-off-by: Zhu Lingshan 
> > ---
> >  block/iscsi.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/iscsi.c b/block/iscsi.c
> > index bd1f1bf..c8d167f 100644
> > --- a/block/iscsi.c
> > +++ b/block/iscsi.c
> > @@ -1243,8 +1243,9 @@ static void iscsi_readcapacity_sync(IscsiLun
> *iscsilun, Error **errp)
> >  iscsilun->lbprz = !!rc16->lbprz;
> >  iscsilun->use_16_for_rw = (rc16->returned_lba >
> 0x);
> >  }
> > +break;
> >  }
> > -break;
> > +//fall through to try readcapacity10 instead
> >  case TYPE_ROM:
> >  task = iscsi_readcapacity10_sync(iscsilun->iscsi,
> iscsilun->lun, 0, 0);
> >  if (task != NULL && task->status == SCSI_STATUS_GOOD) {
> >
>
> For the uninitiated, why does readcapacity16 fail?
>
> My gut feeling is that this is a hack, because:
>
> - Either readcapacity16 should work, or
> - We shouldn't be choosing 10/16 based on the target type to begin with
>
> but I don't know much about iSCSI, so maybe You, Paolo or Peter could
> fill me in.
>
> --js
>

Re: [Qemu-devel] [PATCH] iscsi: Add chap and "initiator-name" etc as per drive options

2015-09-11 Thread ronnie sahlberg

On Fri, Sep 11, 2015 at 8:20 AM, Eric Blake  wrote:
> On 09/11/2015 12:00 AM, Fam Zheng wrote:
>> Previously we use "-iscsi id=target-iqn,user=foo,password=bar,..." to
>> specify iscsi connection parameters, unfortunately it doesn't work with
>> qemu-img.
>>
>> This patch adds per drive options to iscsi driver so that at least
>> qemu-img can use the "json:{...}" filename magic.
>>
>> Signed-off-by: Fam Zheng 
>> ---
>>  block/iscsi.c | 83 
>> +--
>>  1 file changed, 64 insertions(+), 19 deletions(-)
>
> It would be nice to also add a matching BlockdevOptionsIscsi to
> qapi/block-core.json, to allow setting these structured options from
> QMP.  Separate patch is fine, but we need to do the work for ALL of the
> remaining block devices eventually, and now that you are structuring the
> command line is a good time to think about it.
>
>
>>  static void iscsi_nop_timed_event(void *opaque)
>> @@ -1229,6 +1253,27 @@ static QemuOptsList runtime_opts = {
>>  .name = "filename",
>>  .type = QEMU_OPT_STRING,
>>  .help = "URL to the iscsi image",
>> +},{
>> +.name = "user",
>> +.type = QEMU_OPT_STRING,
>> +.help = "username for CHAP authentication to target",
>> +},{
>> +.name = "password",
>> +.type = QEMU_OPT_STRING,
>> +.help = "password for CHAP authentication to target",
>> +},{
>
> Also, this requires passing the password in the command line. We
> _really_ need to solve the problem of allowing the password to be passed
> via a fd or other QMP command, rather than on the command line.


Passing via command line is evil. It should still be possible to pass
all this via a config file to qemu :

"""
...
Howto use a configuration file to set iSCSI configuration options:
@example
cat >iscsi.conf <
> --
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>

Re: [Qemu-devel] [Qemu-stable] Recent patches for 2.4

2015-08-04 Thread ronnie sahlberg

On Tue, Aug 4, 2015 at 5:53 AM, Peter Lieven p...@kamp.de wrote:

 Am 04.08.2015 um 14:29 schrieb Peter Lieven:

 Am 04.08.2015 um 14:09 schrieb Paolo Bonzini:


 On 04/08/2015 13:57, Peter Lieven wrote:

 Okay, what I found out is that in aio_poll I get revents = POLLIN for
 the nfs file descriptor. But there is no data available on the socket.

 Does read return 0 or EAGAIN?

 If it returns EAGAIN, the bug is in the QEMU main loop or the kernel.
 It should never happen that poll returns POLLIN and read returns EAGAIN.

 If it returns 0, it means the other side called shutdown(fd, SHUT_WR).
 Then I think the bug is in the libnfs driver or more likely libnfs.  You
 should stop polling the POLLIN event after read has returned 0 once.


 You might be right. Ronnie originally used the FIONREAD ioctl before
 every read and considered
 the socket as disconnected if the available bytes returned where 0.
 I found that I get available bytes == 0 from that ioctl even if the
 socket was not closed.


You only get 0 from this call if there are actual bytes available to read.

For context,  the problem was that

75
http://git.qemu.org/?p=qemu.git;a=blob;f=block/nfs.c;h=c026ff6883311f2a9b16a4326ad1da97c6c3d4f4;hb=2be4f242b50a84bf360df02480b173bfed161107#l75
static void nfs_process_read(void *arg)
76
http://git.qemu.org/?p=qemu.git;a=blob;f=block/nfs.c;h=c026ff6883311f2a9b16a4326ad1da97c6c3d4f4;hb=2be4f242b50a84bf360df02480b173bfed161107#l76
{
77
http://git.qemu.org/?p=qemu.git;a=blob;f=block/nfs.c;h=c026ff6883311f2a9b16a4326ad1da97c6c3d4f4;hb=2be4f242b50a84bf360df02480b173bfed161107#l77
NFSClient *client = arg;
78
http://git.qemu.org/?p=qemu.git;a=blob;f=block/nfs.c;h=c026ff6883311f2a9b16a4326ad1da97c6c3d4f4;hb=2be4f242b50a84bf360df02480b173bfed161107#l78
nfs_service(client-context, POLLIN);
79
http://git.qemu.org/?p=qemu.git;a=blob;f=block/nfs.c;h=c026ff6883311f2a9b16a4326ad1da97c6c3d4f4;hb=2be4f242b50a84bf360df02480b173bfed161107#l79
nfs_set_events(client);
80
http://git.qemu.org/?p=qemu.git;a=blob;f=block/nfs.c;h=c026ff6883311f2a9b16a4326ad1da97c6c3d4f4;hb=2be4f242b50a84bf360df02480b173bfed161107#l80
}

sometimes trigger and call nfs_service(POLLIN) eventhough the socket is not
readable.
I verified this by adding an extra call to poll() at around line 78
to check whether POLLIN was actually set on the fd or not. Sometimes it
would not be but I got lost in the sources and could not find if or where
this happens or even if qemu even guarantees only call the POLLIN
callbacks if the filedescriptor is actually readable.


The old code in libnfs used to assume that IF we are called for POLLIN and
the if ioctl(FIONREAD) returns that there are 0 bytes available to read
then there was a problem with the socket.

:-(




 This seems to be some kind of bug in Linux - at least what I have thought.

 See BUGS in the select(2) manpage.

Under Linux, select() may report a socket file descriptor as
 ready for reading, while nevertheless a subsequent read blocks. This
 could for example happen when data  has  arrived  but
upon  examination  has  wrong checksum and is discarded. There may
 be other circumstances in which a file descriptor is spuriously reported as
 ready.  Thus it may be safer to use O_NON‐
BLOCK on sockets that should not block.

 I will debug further, but it seems to be that I receive a POLLIN even if
 there is no data available. I see 0 bytes from the recv call inside libnfs
 and continue without a deadlock - at least
 so far.

 Would it be a good idea to count the number of 0 bytes from recv and
 react after I received 0 bytes for a number of consecutive times?

 And then: stop polling POLLIN or reconnect?


 Okay, got it. Ronnie was using FIONREAD without checking for EAGAIN or
 EINTR.

 I will send a patch for libnfs to reconnect if count == 0. Libiscsi is not
 affected, it reconnects if count is 0.


Thanks, and merged.




 Peter

Re: [Qemu-devel] [PATCH] block/iscsi: add support for request timeouts

2015-06-22 Thread ronnie sahlberg

LGTM

It is good to finally have timeouts that work in libiscsi,  and a consumer
that can use and benefit from it.

On Tue, Jun 16, 2015 at 4:45 AM, Peter Lieven p...@kamp.de wrote:

 libiscsi starting with 1.15 will properly support timeout of iscsi
 commands. The default will remain no timeout, but this can
 be changed via cmdline parameters, e.g.:

 qemu -iscsi timeout=30 -drive file=iscsi://...

 If a timeout occurs a reconnect is scheduled and the timed out command
 will be requeued for processing after a successful reconnect.

 The required API call iscsi_set_timeout is present since libiscsi
 1.10 which was released in October 2013. However, due to some bugs
 in the libiscsi code the use is not recommended before version 1.15.

 Please note that this patch bumps the libiscsi requirement to 1.10
 to have all function and macros defined. The patch fixes also a
 off-by-one error in the NOP timeout calculation which was fixed
 while touching these code parts.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c   | 87
 ++---
  configure   |  6 ++--
  qemu-options.hx |  4 +++
  3 files changed, 72 insertions(+), 25 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index 14e97a6..f19a56a 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -69,6 +69,7 @@ typedef struct IscsiLun {
  bool dpofua;
  bool has_write_same;
  bool force_next_flush;
 +bool request_timed_out;
  } IscsiLun;

  typedef struct IscsiTask {
 @@ -99,7 +100,8 @@ typedef struct IscsiAIOCB {
  #endif
  } IscsiAIOCB;

 -#define EVENT_INTERVAL 250
 +/* libiscsi uses time_t so its enough to process events every second */
 +#define EVENT_INTERVAL 1000
  #define NOP_INTERVAL 5000
  #define MAX_NOP_FAILURES 3
  #define ISCSI_CMD_RETRIES ARRAY_SIZE(iscsi_retry_times)
 @@ -186,13 +188,18 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int
 status,
  iTask-do_retry = 1;
  goto out;
  }
 -/* status 0x28 is SCSI_TASK_SET_FULL. It was first introduced
 - * in libiscsi 1.10.0. Hardcode this value here to avoid
 - * the need to bump the libiscsi requirement to 1.10.0 */
 -if (status == SCSI_STATUS_BUSY || status == 0x28) {
 +if (status == SCSI_STATUS_BUSY || status ==
 SCSI_STATUS_TIMEOUT ||
 +status == SCSI_STATUS_TASK_SET_FULL) {
  unsigned retry_time =
  exp_random(iscsi_retry_times[iTask-retries - 1]);
 -error_report(iSCSI Busy/TaskSetFull (retry #%u in %u
 ms): %s,
 +if (status == SCSI_STATUS_TIMEOUT) {
 +/* make sure the request is rescheduled AFTER the
 + * reconnect is initiated */
 +retry_time = EVENT_INTERVAL * 2;
 +iTask-iscsilun-request_timed_out = true;
 +}
 +error_report(iSCSI Busy/TaskSetFull/TimeOut
 +  (retry #%u in %u ms): %s,
   iTask-retries, retry_time,
   iscsi_get_error(iscsi));
  aio_timer_init(iTask-iscsilun-aio_context,
 @@ -276,20 +283,26 @@ iscsi_set_events(IscsiLun *iscsilun)
 iscsilun);
  iscsilun-events = ev;
  }
 -
 -/* newer versions of libiscsi may return zero events. In this
 - * case start a timer to ensure we are able to return to service
 - * once this situation changes. */
 -if (!ev) {
 -timer_mod(iscsilun-event_timer,
 -  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
 EVENT_INTERVAL);
 -}
  }

 -static void iscsi_timed_set_events(void *opaque)
 +static void iscsi_timed_check_events(void *opaque)
  {
  IscsiLun *iscsilun = opaque;
 +
 +/* check for timed out requests */
 +iscsi_service(iscsilun-iscsi, 0);
 +
 +if (iscsilun-request_timed_out) {
 +iscsilun-request_timed_out = false;
 +iscsi_reconnect(iscsilun-iscsi);
 +}
 +
 +/* newer versions of libiscsi may return zero events. Ensure we are
 able
 + * to return to service once this situation changes. */
  iscsi_set_events(iscsilun);
 +
 +timer_mod(iscsilun-event_timer,
 +  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + EVENT_INTERVAL);
  }

  static void
 @@ -1096,16 +1109,37 @@ static char *parse_initiator_name(const char
 *target)
  return iscsi_name;
  }

 +static int parse_timeout(const char *target)
 +{
 +QemuOptsList *list;
 +QemuOpts *opts;
 +const char *timeout;
 +
 +list = qemu_find_opts(iscsi);
 +if (list) {
 +opts = qemu_opts_find(list, target);
 +if (!opts) {
 +opts = QTAILQ_FIRST(list-head);
 +}
 +if (opts) {
 +timeout = qemu_opt_get(opts, timeout);
 +if (timeout) {
 +return atoi(timeout);
 +}
 +}
 +}

Re: [Qemu-devel] [PATCH] block/iscsi: do not forget to logout from target

2015-04-14 Thread ronnie sahlberg

Reviewed-By: Ronnie Sahlberg ronniesahlb...@gmail.com

On Tue, Apr 14, 2015 at 1:37 AM, Peter Lieven p...@kamp.de wrote:
 We actually were always impolitely dropping the connection and
 not cleanly logging out.

 Cc: qemu-sta...@nongnu.org
 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c | 6 ++
  1 file changed, 6 insertions(+)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index ab20e4d..0b6d3dd 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -1503,6 +1503,9 @@ out:

  if (ret) {
  if (iscsi != NULL) {
 +if (iscsi_is_logged_in(iscsi)) {
 +iscsi_logout_sync(iscsi);
 +}
  iscsi_destroy_context(iscsi);
  }
  memset(iscsilun, 0, sizeof(IscsiLun));
 @@ -1516,6 +1519,9 @@ static void iscsi_close(BlockDriverState *bs)
  struct iscsi_context *iscsi = iscsilun-iscsi;

  iscsi_detach_aio_context(bs);
 +if (iscsi_is_logged_in(iscsi)) {
 +iscsi_logout_sync(iscsi);
 +}
  iscsi_destroy_context(iscsi);
  g_free(iscsilun-zeroblock);
  g_free(iscsilun-allocationmap);
 --
 1.9.1

Re: [Qemu-devel] [Qemu-block] [PATCH] block/iscsi: handle zero events from iscsi_which_events

2015-04-08 Thread ronnie sahlberg

On Wed, Apr 8, 2015 at 1:38 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Tue, Apr 7, 2015 at 9:08 PM, Peter Lieven p...@kamp.de wrote:

 Please CC qemu-devel@nongnu.org in the future.  All patches must be on
 the qemu-devel mailing list so they can be merged (for transparency).
 I have added qemu-devel to CC.

 +/* newer versions of libiscsi may return zero events. In this
 + * case start a timer to ensure we are able to return to service
 + * once this situation changes. */
 +if (!ev) {
 +timer_mod(iscsilun-event_timer,
 +  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + EVENT_INTERVAL);
  }

 This suggests that libiscsi is waiting on its own internal timer.  It
 would be cleaner for libiscsi to provide a timeout value so the
 application doesn't need to poll the library.  Is there still scope to
 modify libiscsi to do this?

I think that returning a timeout value would be a bigger change in the
API that Peter wanted to avoid.
We discussed that as an option by my take from it was that this would
require that qemu and libiscsi
again would have to be updated in lockstep,

In this particular case with creating a delay between failed reconnect
attempts, i.e. the target is restarting and
returning a RST to the SYN request until it has fully restarted, the
correct amount of time to wait until re-trying is at best a guess
anyway.
I don't have any particular feelings on whether the decision on how
long to wait is done inside libiscsi and communicated to qemu,
or if it is done in qemu itself.

The nice part with the current patch of Peter is that qemu and
libiscsi can be upgraded/downgraded independently.

Re: [Qemu-devel] [PATCH 2/4] block: immediately cancel oversized read/write requests

2014-09-08 Thread ronnie sahlberg

On Mon, Sep 8, 2014 at 7:42 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 08/09/2014 16:35, Peter Lieven ha scritto:

 messages. :)
 So you would not throw an error msg here?
 No, though a trace could be useful.

 Is there a howto somewhere how to implement that?

 Try commit 4ac4458076e1aaf3b01a6361154117df20e22215.

 Whats your opinion changed the max_xfer_len to 0x regardsless
 of use_16_for_rw in iSCSI?

 If you implemented request splitting in the block layer, it would be
 okay to force max_xfer_len to 0x.

I think it should be OK if that is a different series. This only
affects the iSCSI transport since it is the only transport so far to
record or report a max transfer length.
If a guest generates these big requests(i.e. not multiwrite) we
already fail them due to the READ10 limitations. We would just fail
the request at a different layer in the stack with these patches.


What I would like to see would also be to report these limitations to
the guest itself to prevent it from generating too large I/Os.
I am willing to do that part once the initial max_xfer_len ones are finished.

Re: [Qemu-devel] [PATCH 0/4] introduce max_transfer_length

2014-09-05 Thread ronnie sahlberg

Looks good to me.

(minor question is just why not let default max be 0x for both 10
and 16 CDBs ?)

On Fri, Sep 5, 2014 at 9:51 AM, Peter Lieven p...@kamp.de wrote:
 This series adds the basics for introducing a maximum transfer length
 to the block layer. Its main purpose is currently avoiding that
 a multiwrite_merge exceeds the max_xfer_len of an attached iSCSI LUN.
 This is a required bug fix.

 Discussed reporting of this maximum in the SCSI Disk Inquiry Emulation
 of the Block Limits VPD is currently not added as we do not import any
 of the other limits there. This has to be addresses in a seperate series.

 Peter Lieven (4):
   BlockLimits: introduce max_transfer_length
   block: immediate cancel oversized read/write requests
   block/iscsi: set max_transfer_length
   block: avoid creating oversized writes in multiwrite_merge

  block.c   |   23 +++
  block/iscsi.c |   12 ++--
  include/block/block_int.h |3 +++
  3 files changed, 36 insertions(+), 2 deletions(-)

 --
 1.7.9.5

Re: [Qemu-devel] [PATCH 0/4] introduce max_transfer_length

2014-09-05 Thread ronnie sahlberg

Feel free to add a

Reviewed-by: Ronnie Sahlberg ronniesahlb...@gmail.com

to the patches.

On Fri, Sep 5, 2014 at 12:52 PM, Peter Lieven p...@kamp.de wrote:
 Am 05.09.2014 um 19:05 schrieb ronnie sahlberg:
 Looks good to me.

 (minor question is just why not let default max be 0x for both 10
 and 16 CDBs ?)

 You are right. I was looking at the technical limit, but in fact it doesn't
 make sense to have different limits. Its ridiculous to say, you wan't to
 do big I/O then you need a target thats bigger than 2TB ;-)

 Peter


 On Fri, Sep 5, 2014 at 9:51 AM, Peter Lieven p...@kamp.de wrote:
 This series adds the basics for introducing a maximum transfer length
 to the block layer. Its main purpose is currently avoiding that
 a multiwrite_merge exceeds the max_xfer_len of an attached iSCSI LUN.
 This is a required bug fix.

 Discussed reporting of this maximum in the SCSI Disk Inquiry Emulation
 of the Block Limits VPD is currently not added as we do not import any
 of the other limits there. This has to be addresses in a seperate series.

 Peter Lieven (4):
   BlockLimits: introduce max_transfer_length
   block: immediate cancel oversized read/write requests
   block/iscsi: set max_transfer_length
   block: avoid creating oversized writes in multiwrite_merge

  block.c   |   23 +++
  block/iscsi.c |   12 ++--
  include/block/block_int.h |3 +++
  3 files changed, 36 insertions(+), 2 deletions(-)

 --
 1.7.9.5

Re: [Qemu-devel] [PATCH] block/iscsi: use 16 byte CDBs only when necessary

2014-09-03 Thread ronnie sahlberg

On Wed, Sep 3, 2014 at 1:09 AM, Peter Lieven p...@kamp.de wrote:


 Am 02.09.2014 um 21:30 schrieb Peter Lieven p...@kamp.de:

 Looking at the code, is it possible that not the guest is causing trouble 
 here, but
 multiwrite_merge code?

 From what I see the only limit it has when merging requests is the number of 
 IOVs.


 Any thoughts?

 Mine are:
 a) Introducing bs-bl.max_request_size and set merge = 0 if the result would 
 be too big. Default
 max request size to 32768 sectors (see below).
 b) Hardcoding the limit in multiwrite_merge for now limiting the merged size 
 to 16MB (32768 sectors).
 Which is the limit we already use in bdrv_co_discard and 
 bdrv_co_write_zeroes if we don't know
 better.

 or c) disabling multiwrite merge for RAW or only iSCSI completely.


I think (a) would be best.
But I would suggest some small modifications:

Set the default max to something even smaller, like 256 sectors. This
should not have much effect on throughput since the client/initiator
can just send several concurrent i/os instead of one large i/o.

Also update scsi_disk_emulate_inquiry() so we tell the guest the
maximum transfer length for a single i/o
so that the guest kernel will break up any huge requests into proper sizes.

Re: [Qemu-devel] [PATCH] block/iscsi: use 16 byte CDBs only when necessary

2014-09-03 Thread ronnie sahlberg

On Wed, Sep 3, 2014 at 7:18 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 03/09/2014 16:17, ronnie sahlberg ha scritto:
 I think (a) would be best.
 But I would suggest some small modifications:

 Set the default max to something even smaller, like 256 sectors. This
 should not have much effect on throughput since the client/initiator
 can just send several concurrent i/os instead of one large i/o.

 I disagree.  256 sectors are 128k, that's a fairly normal I/O size.


Fair enough.
But maybe a command line argument to set/override the max?

This would be useful when using scsi-passthrough to usb-scsi disks.
Linux kernel has I think a hard limit on 120k for the usb transport,
which means when doing SG_IO to a usb disk you are limited to this
as the max transfer length since anything larger will break in the usb layer.
This limit also I think is not easily discoverable by an application since
* the actual device still reports, in most cases, max_transfer_length
== unlimited
* and it is semi-hard to discover whether a /dev/sg* device is on a
usb bus or not.

Re: [Qemu-devel] [PATCH] block/iscsi: use 16 byte CDBs only when necessary

2014-09-02 Thread ronnie sahlberg

That is one big request.  I assume the device reports no limit in
the VPD page so we can not state it is the guest/application going
beyond the allowed limit?


I am not entirely sure what meaning the target assigns to Protocol
Error means here.
It could be that ~100M is way higher than MaxBurstLength ?  What is
the MaxBurstLength that was reported by the server during login
negotiation?
If so, we should make libiscsi check the maxburstlength and fail the
request early. We would still fail the I/O so it will not really solve
anything much
but at least we should not send the request to the server.

Best would probably be to take the smallest of a non-zero
Block-Limits.max_transfer_length and iscsi-MaxBurstLength/block-size
and pass this back to the guest in the emulated Block-Limits-VPD.
At least then you have tried to tell the guest never do SCSI I/O
bigger than this.

I.e. even if the target reports BlockLimits.MaxTransferLength == 0 ==
no limit to QEMU, QEMU should probably take the iscsi transport limit
into account and pass this to the guest
by setting the emulated BlockLimits page it passes to scale to the
maximum that MaxBurstLength allows.


Then if BTRFS or SG_IO in the guest ignores the BlockLimits it is
clearly a guest problem.

(A different interpretation for ProtocolError could be the mismatch
between the iscsi expected data transfer length and the scsi transfer
length, but that should result in residuals, not protocol error.)



Hypothetically there could be targets that support really huge
MaxBurstLengths  32MB. For those you probably want to switch to
WRITE16 when the SCSI transfer length goes  0x.

- if (iscsilun-use_16_for_rw)  {
+ if (iscsilun-use_16_for_rw || num_sectors  0x)  {


regards
ronnie sahlberg

On Mon, Sep 1, 2014 at 8:21 AM, Peter Lieven p...@kamp.de wrote:
 On 17.06.2014 13:46, Paolo Bonzini wrote:

 Il 17/06/2014 13:37, Peter Lieven ha scritto:

 On 17.06.2014 13:15, Paolo Bonzini wrote:

 Il 17/06/2014 08:14, Peter Lieven ha scritto:



 BTW, while debugging a case with a bigger storage supplier I found
 that open-iscsi seems to do exactly this undeterministic behaviour.
 I have a 3TB LUN. If I access  2TB sectors it uses READ10/WRITE10 and
 if I go beyond 2TB it changes to READ16/WRITE16.


 Isn't that exactly what your latest patch does for 64K sector writes? :)


 Not exactly, we choose the default by checking the LUN size. 10 Byte for
  2TB and 16 Byte otherwise.


 Yeah, I meant introducing the non-determinism.

 My latest patch makes an exception if a request is bigger than 64K
 sectors and
 switches to 16 Byte requests. These would otherwise end in an I/O error.


 It could also be split at the block layer, like we do for unmap.  I think
 there's also a maximum transfer size somewhere in the VPD, we could to
 READ16/WRITE16 if it is 64K sectors.


 It seems that there might be a real world example where Linux issues 32MB
 write requests. Maybe someone familiar with btrfs can advise.
 I see iSCSI Protocol Errors in my logs:

 Sep  1 10:10:14 libiscsi:0 PDU header: 01 a1 00 00 00 01 00 00 00 00 00 00
 00 00 00 00 00 00 00 07 06 8f 30 00 00 00 00 06 00 00 00 0a 2a 00 01 09 9e
 50 00 47 98 00 00 00 00 00 00 00 [XXX]
 Sep  1 10:10:14 qemu-2.0.0: iSCSI: Failed to write10 data to iSCSI lun.
 Request was rejected with reason: 0x04 (Protocol Error)

 Looking at the headers the xferlen in the iSCSI PDU is 110047232 Byte which
 is 214936 sectors.
 214936 % 65536 = 18328 which is exactly the number of blocks in the SCSI
 WRITE10 CDB.

 Can someone advise if this is something that btrfs can cause
 or if I have to
 blame the customer that he issues very big write requests with Direct I/O?

 The user sseems something like this in the log:
 [34640.489284] BTRFS: bdev /dev/vda2 errs: wr 8232, rd 0, flush 0, corrupt
 0, gen 0
 [34640.490379] end_request: I/O error, dev vda, sector 17446880
 [34640.491251] end_request: I/O error, dev vda, sector 5150144
 [34640.491290] end_request: I/O error, dev vda, sector 17472080
 [34640.492201] end_request: I/O error, dev vda, sector 17523488
 [34640.492201] end_request: I/O error, dev vda, sector 17536592
 [34640.492201] end_request: I/O error, dev vda, sector 17599088
 [34640.492201] end_request: I/O error, dev vda, sector 17601104
 [34640.685611] end_request: I/O error, dev vda, sector 15495456
 [34640.685650] end_request: I/O error, dev vda, sector 7138216

 Thanks,
 Peter

Re: [Qemu-devel] [RFC] How to handle feature regressions in new QEMU releases

2014-07-23 Thread ronnie sahlberg

On Wed, Jul 16, 2014 at 10:29 AM, Michael Tokarev m...@tls.msk.ru wrote:
 16.07.2014 21:23, ronnie sahlberg wrote:

 If you ask debian to upgrade. Could you ask them to wait and upgrade after I
 have release the next version, hopefully if all goes well, at the end
 of this week?

 There's no problem in updating now to fix missing .pc file and to update
 next week to include a new version.


Please find a new version 1.12 on the website.

Thanks.
ronnie sahlberg

Re: [Qemu-devel] [PULL 00/14] SCSI changes for 2014-06-18

2014-06-18 Thread ronnie sahlberg

On Wed, Jun 18, 2014 at 8:57 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 18/06/2014 17:53, Peter Maydell ha scritto:

 Then you probably need to fix the configure test ;-) This is
 Ubuntu Trusty so really pretty recent (it's my main x86 build
 machine, not some oddball platform); libiscsi-dev 1.4.0-3.


 This is a very old release.  I'll apply Peter's patch to disable libiscsi
 for pre-1.9.0 versions and respin the pull request.

 I didn't want to do that because 1.4.0-1.9.0 breaks ABI, but Peter thinks
 that versions before 1.8.0 are not reliable anyway.

+1

Versions prior are good for causal use but Peter is the authority on
production use here.
I think qemu should have a configure check and just force disable
iscsi support if libiscsi  1.9.0



 If Ubuntu wants to distribute both 1.4.0 and 1.9.0, they can move 1.9.0
 header files to a separate directory using pkgconfig (1.4.0 does not support
 pkgconfig).

 Paolo

Re: [Qemu-devel] [PATCH] block/iscsi: use 16 byte CDBs only when necessary

2014-06-04 Thread ronnie sahlberg

Looks good.

As an alternative, you could do the 10 vs 16 decision based on the LBA
instead of the size of the device :

-if (use_16_for_ws) {
+   if (lba = 0x1) {
iTask.task = iscsi_writesame16_task(iscsilun-iscsi, iscsilun-lun, lba,
iscsilun-zeroblock,
iscsilun-block_size,
nb_blocks, 0, !!(flags 
BDRV_REQ_MAY_UNMAP),
0, 0, iscsi_co_generic_cb, iTask);
} else {
iTask.task = iscsi_writesame10_task(iscsilun-iscsi, iscsilun-lun, lba,
iscsilun-zeroblock,
iscsilun-block_size,
nb_blocks, 0, !!(flags 
BDRV_REQ_MAY_UNMAP),
0, 0, iscsi_co_generic_cb, iTask);
}

That would mean you get to use the 10 version of the cdb even for very
large devices (as long as the IO is for blocks at the beginning of the
device) and thus provide partial avoidance of this issue for those
large devices.


ronnie shalberg


On Wed, Jun 4, 2014 at 6:47 AM, Peter Lieven p...@kamp.de wrote:
 this patch changes the driver to uses 16 Byte CDBs for
 READ/WRITE only if the target requires 64bit lba addressing.

 On one hand this saves 6 bytes in each PDU on the other
 hand it seems that 10 Byte CDBs seems to be much better
 supported and tested as a recent issue I had with a
 major storage supplier lined out.

 For WRITESAME the logic is a bit more tricky as WRITESAME10
 with UNMAP was added really late. Thus a fallback to WRITESAME16
 is possible if it supports UNMAP and WRITESAME10 not.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |   58 
 +++--
  1 file changed, 40 insertions(+), 18 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index d241e83..019b324 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -65,6 +65,7 @@ typedef struct IscsiLun {
  unsigned char *zeroblock;
  unsigned long *allocationmap;
  int cluster_sectors;
 +bool use_16_for_rw;
  } IscsiLun;

  typedef struct IscsiTask {
 @@ -368,10 +369,17 @@ static int coroutine_fn 
 iscsi_co_writev(BlockDriverState *bs,
  num_sectors = sector_qemu2lun(nb_sectors, iscsilun);
  iscsi_co_init_iscsitask(iscsilun, iTask);
  retry:
 -iTask.task = iscsi_write16_task(iscsilun-iscsi, iscsilun-lun, lba,
 -data, num_sectors * iscsilun-block_size,
 -iscsilun-block_size, 0, 0, 0, 0, 0,
 -iscsi_co_generic_cb, iTask);
 +if (iscsilun-use_16_for_rw) {
 +iTask.task = iscsi_write16_task(iscsilun-iscsi, iscsilun-lun, lba,
 +data, num_sectors * 
 iscsilun-block_size,
 +iscsilun-block_size, 0, 0, 0, 0, 0,
 +iscsi_co_generic_cb, iTask);
 +} else {
 +iTask.task = iscsi_write10_task(iscsilun-iscsi, iscsilun-lun, lba,
 +data, num_sectors * 
 iscsilun-block_size,
 +iscsilun-block_size, 0, 0, 0, 0, 0,
 +iscsi_co_generic_cb, iTask);
 +}
  if (iTask.task == NULL) {
  g_free(buf);
  return -ENOMEM;
 @@ -545,20 +553,17 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState 
 *bs,

  iscsi_co_init_iscsitask(iscsilun, iTask);
  retry:
 -switch (iscsilun-type) {
 -case TYPE_DISK:
 +if (iscsilun-use_16_for_rw) {
  iTask.task = iscsi_read16_task(iscsilun-iscsi, iscsilun-lun, lba,
 num_sectors * iscsilun-block_size,
 iscsilun-block_size, 0, 0, 0, 0, 0,
 iscsi_co_generic_cb, iTask);
 -break;
 -default:
 +} else {
  iTask.task = iscsi_read10_task(iscsilun-iscsi, iscsilun-lun, lba,
 num_sectors * iscsilun-block_size,
 iscsilun-block_size,
 0, 0, 0, 0, 0,
 iscsi_co_generic_cb, iTask);
 -break;
  }
  if (iTask.task == NULL) {
  return -ENOMEM;
 @@ -864,19 +869,27 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState 
 *bs, int64_t sector_num,
  struct IscsiTask iTask;
  uint64_t lba;
  uint32_t nb_blocks;
 +bool use_16_for_ws = iscsilun-use_16_for_rw;

  if (!is_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
  return -EINVAL;
  }

 -if ((flags  BDRV_REQ_MAY_UNMAP)  !iscsilun-lbp.lbpws) {
 -/* WRITE SAME with UNMAP is not supported by the target,
 - * fall back and try WRITE SAME without UNMAP */
 -flags =

Re: [Qemu-devel] [PATCH] block/iscsi: use 16 byte CDBs only when necessary

2014-06-04 Thread ronnie sahlberg

On Wed, Jun 4, 2014 at 7:43 AM, Peter Lieven p...@kamp.de wrote:
 Am 04.06.2014 16:00, schrieb ronnie sahlberg:
 Looks good.

 As an alternative, you could do the 10 vs 16 decision based on the LBA
 instead of the size of the device :

 -if (use_16_for_ws) {
 +   if (lba = 0x1) {
 iTask.task = iscsi_writesame16_task(iscsilun-iscsi, iscsilun-lun, 
 lba,
 iscsilun-zeroblock,
 iscsilun-block_size,
 nb_blocks, 0, !!(flags 
 BDRV_REQ_MAY_UNMAP),
 0, 0, iscsi_co_generic_cb, 
 iTask);
 } else {
 iTask.task = iscsi_writesame10_task(iscsilun-iscsi, iscsilun-lun, 
 lba,
 iscsilun-zeroblock,
 iscsilun-block_size,
 nb_blocks, 0, !!(flags 
 BDRV_REQ_MAY_UNMAP),
 0, 0, iscsi_co_generic_cb, 
 iTask);
 }

 That would mean you get to use the 10 version of the cdb even for very
 large devices (as long as the IO is for blocks at the beginning of the
 device) and thus provide partial avoidance of this issue for those
 large devices.

 I like that idea, however I fear that this would introduce additional bugs.

Fair enough. Basing it on device size is fine.

 - Using 10 Byte CDBs where the target might expect 16 Byte CDBs?!

You can still use 10 byte commands, even on very large devices. You
just can not access the whole device using the 10 byte commands.
(I recall some of the legacy unixen used to do this in the old days.
Using READ6 for low LBAs and then switching to READ10 when the LBA got
too big for READ6.)


 - What if lba + num_blocks  2^32-1 ?

That is fine. As long as lba = 2^32-1


This means that with 10 you can actually read/write 65534 blocks
beyond block 2^32 :-)
LBA==0x and NumBlocks=0x
will read blocks 0x all the way to 0x1fffd



 The switch I added is like Linux does it - as Paolo pointed out earlier.

 In my case the number of 2TB Targets is not that big so I can work with
 the switch based on the capacity. Until the bug is fixed I just can't move
 those 2TB volumes around on the storage array.

 Peter



 ronnie shalberg


 On Wed, Jun 4, 2014 at 6:47 AM, Peter Lieven p...@kamp.de wrote:
 this patch changes the driver to uses 16 Byte CDBs for
 READ/WRITE only if the target requires 64bit lba addressing.

 On one hand this saves 6 bytes in each PDU on the other
 hand it seems that 10 Byte CDBs seems to be much better
 supported and tested as a recent issue I had with a
 major storage supplier lined out.

 For WRITESAME the logic is a bit more tricky as WRITESAME10
 with UNMAP was added really late. Thus a fallback to WRITESAME16
 is possible if it supports UNMAP and WRITESAME10 not.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |   58 
 +++--
  1 file changed, 40 insertions(+), 18 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index d241e83..019b324 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -65,6 +65,7 @@ typedef struct IscsiLun {
  unsigned char *zeroblock;
  unsigned long *allocationmap;
  int cluster_sectors;
 +bool use_16_for_rw;
  } IscsiLun;

  typedef struct IscsiTask {
 @@ -368,10 +369,17 @@ static int coroutine_fn 
 iscsi_co_writev(BlockDriverState *bs,
  num_sectors = sector_qemu2lun(nb_sectors, iscsilun);
  iscsi_co_init_iscsitask(iscsilun, iTask);
  retry:
 -iTask.task = iscsi_write16_task(iscsilun-iscsi, iscsilun-lun, lba,
 -data, num_sectors * 
 iscsilun-block_size,
 -iscsilun-block_size, 0, 0, 0, 0, 0,
 -iscsi_co_generic_cb, iTask);
 +if (iscsilun-use_16_for_rw) {
 +iTask.task = iscsi_write16_task(iscsilun-iscsi, iscsilun-lun, 
 lba,
 +data, num_sectors * 
 iscsilun-block_size,
 +iscsilun-block_size, 0, 0, 0, 0, 
 0,
 +iscsi_co_generic_cb, iTask);
 +} else {
 +iTask.task = iscsi_write10_task(iscsilun-iscsi, iscsilun-lun, 
 lba,
 +data, num_sectors * 
 iscsilun-block_size,
 +iscsilun-block_size, 0, 0, 0, 0, 
 0,
 +iscsi_co_generic_cb, iTask);
 +}
  if (iTask.task == NULL) {
  g_free(buf);
  return -ENOMEM;
 @@ -545,20 +553,17 @@ static int coroutine_fn 
 iscsi_co_readv(BlockDriverState *bs,

  iscsi_co_init_iscsitask(iscsilun, iTask);
  retry:
 -switch (iscsilun-type) {
 -case TYPE_DISK:
 +if (iscsilun-use_16_for_rw) {
  iTask.task = iscsi_read16_task(iscsilun-iscsi, iscsilun-lun, lba,
 num_sectors

Re: [Qemu-devel] [PATCH 08/20] nfs: Handle failure for potentially large allocations

2014-05-22 Thread ronnie sahlberg

For this case and for the iscsi case, isn't it likely that once this
happens the guest is pretty much doomed since I/O to the disk will no
longer work ?

Should we also change the block layer so that IF *_readv/_writev fails
with -ENOMEM
then it should try again but break the request up into a chain of
smaller chunks ?



On Wed, May 21, 2014 at 9:28 AM, Kevin Wolf kw...@redhat.com wrote:
 Some code in the block layer makes potentially huge allocations. Failure
 is not completely unexpected there, so avoid aborting qemu and handle
 out-of-memory situations gracefully.

 This patch addresses the allocations in the nfs block driver.

 Signed-off-by: Kevin Wolf kw...@redhat.com
 ---
  block/nfs.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

 diff --git a/block/nfs.c b/block/nfs.c
 index 539bd95..e3d6216 100644
 --- a/block/nfs.c
 +++ b/block/nfs.c
 @@ -165,7 +165,11 @@ static int coroutine_fn nfs_co_writev(BlockDriverState 
 *bs,

  nfs_co_init_task(client, task);

 -buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
 +buf = g_try_malloc(nb_sectors * BDRV_SECTOR_SIZE);
 +if (buf == NULL) {
 +return -ENOMEM;
 +}
 +
  qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);

  if (nfs_pwrite_async(client-context, client-fh,
 --
 1.8.3.1

Re: [Qemu-devel] [PATCH 08/22] iscsi: implement .bdrv_detach/attach_aio_context()

2014-05-08 Thread ronnie sahlberg

On Thu, May 8, 2014 at 4:33 AM, Stefan Hajnoczi stefa...@redhat.com wrote:
 On Wed, May 07, 2014 at 04:09:27PM +0200, Peter Lieven wrote:
 On 07.05.2014 12:29, Paolo Bonzini wrote:
 Il 07/05/2014 12:07, Stefan Hajnoczi ha scritto:
 On Fri, May 02, 2014 at 12:39:06AM +0200, Peter Lieven wrote:
 +static void iscsi_attach_aio_context(BlockDriverState *bs,
 + AioContext *new_context)
 +{
 +IscsiLun *iscsilun = bs-opaque;
 +
 +iscsilun-aio_context = new_context;
 +iscsi_set_events(iscsilun);
 +
 +#if defined(LIBISCSI_FEATURE_NOP_COUNTER)
 +/* Set up a timer for sending out iSCSI NOPs */
 +iscsilun-nop_timer = aio_timer_new(iscsilun-aio_context,
 + QEMU_CLOCK_REALTIME, SCALE_MS,
 + iscsi_nop_timed_event, iscsilun);
 +timer_mod(iscsilun-nop_timer,
 +  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + NOP_INTERVAL);
 +#endif
 +}
 
 Is it still guaranteed that iscsi_nop_timed_event for a target is not 
 invoked
 while we are in another function/callback of the iscsi driver for the 
 same target?
 
 Yes, since the timer is in the same AioContext as the iscsi driver 
 callbacks.


 Ok. Stefan: What MUST NOT happen is that the timer gets fired while we are 
 in iscsi_service.
 As Paolo outlined, this cannot happen, right?

 Okay, I think we're safe then.  The timer can only be invoked during
 aio_poll() event loop iterations.  It cannot be invoked while we're
 inside iscsi_service().

 BTW, is iscsi_reconnect() the right libiscsi interface to use since it
 is synchronous?  It seems like this would block QEMU until the socket
 has connected!  The guest would be frozen.
 
 There is no asynchronous interface yet for reconnection, unfortunately.

 We initiate the reconnect after we miss a few NOP replies. So the target is 
 already down for approx. 30 seconds.
 Every process inside the guest is already haging or has timed out.

 If I understand correctly with the new patches only the communication with 
 this target is hanging or isn't it?
 So what benefit would an asyncronous reconnect have?

 Asynchronous reconnect is desirable:

 1. The QEMU monitor is blocked while we're waiting for the iSCSI target
to accept our reconnect.  This means the management stack (libvirt)
cannot control QEMU until we time out or succeed.

 2. The guest is totally frozen - cannot execute instructions - because
it will soon reach a point in the code that locks the QEMU global
mutex (which is being held while we reconnect to the iSCSI target).

This may be okayish for guests where the iSCSI LUN contains the
main data that is being processed.  But what if an iSCSI LUN was
just attached to a guest that is also doing other things that are
independent (e.g. serving a website, processing data from a local
disk, etc) - now the reconnect causes downtime for the entire guest.

I will look into making the reconnect async over the next few days.

Re: [Qemu-devel] [PATCHv6 0/6] block: add native support for NFS

2014-01-28 Thread ronnie sahlberg

Please try to sync libnfs again.   Autotools is still a bit of a mystery for me.

On Tue, Jan 28, 2014 at 2:16 AM, Stefan Hajnoczi stefa...@redhat.com wrote:
 On Mon, Jan 27, 2014 at 07:28:07PM +0100, Peter Lieven wrote:
 Am 27.01.2014 17:11, schrieb Stefan Hajnoczi:
  On Mon, Jan 13, 2014 at 11:21:52AM +0100, Peter Lieven wrote:
  This adds v6 of the NFS protocol driver + qemu-iotest adjustments.
 
  v5-v6:
   - use internal qemu function to parse the NFS url [Kevin]
   - zero pad short reads [Kevin, Paolo]
   - added qemu-iotests patches for basic nfs protocol support
 
  v4-v5:
   - disussed with Ronnie and decided to move URL + Paramter parsing to 
  LibNFS.
 This allows for URL parameter processing directly in LibNFS without 
  altering
 the qemu NFS block driver. This bumps the version requirement for 
  LibNFS
 to 1.9.0 though.
   - added a pointer to the LibNFS readme where additional information about
 ROOT privilidge requirements can be found as this raised a few 
  concerns.
   - removed a trailing dot in an error statement [Fam].
 
  v3-v4:
   - finally added full implementation of bdrv_get_allocated_file_size 
  [Stefan]
   - removed trailing \n from error statements [Stefan]
 
  v2-v3:
   - rebased the stefanha/block
   - use pkg_config to check for libnfs (ignoring cflags which are broken 
  in 1.8.0) [Stefan]
   - fixed NFSClient declaration [Stefan]
   - renamed Task variables to task [Stefan]
   - renamed NFSTask to NFSRPC [Ronnie]
   - do not update bs-total_sectors in nfs_co_writev [Stefan]
   - return -ENOMEM on all async call failures [Stefan,Ronnie]
   - fully implement ftruncate
   - use util/uri.c for URL parsing [Stefan]
   - reworked nfs_file_open_common to nfs_client_open which works on 
  NFSClient [Stefan]
   - added a comment ot the connect message that libnfs support NFSv3 only 
  at the moment.
   - DID NOT add full implementation of bdrv_get_allocated_file_size because
 we are not in a coroutine context and I cannot do an async call here.
 I could do a sync call if there would be a guarantee that no requests
 are in flight. [Stefan]
 
  v1-v2:
   - fixed block/Makefile.objs [Ronnie]
   - do not always register a read handler [Ronnie]
   - add support for reading beyond EOF [Fam]
   - fixed struct and paramter naming [Fam]
   - fixed overlong lines and whitespace errors [Fam]
   - return return status from libnfs whereever possible [Fam]
   - added comment why we set allocated_file_size to -ENOTSUP after write 
  [Fam]
   - avoid segfault when parsing filname [Fam]
   - remove unused close_bh from NFSClient [Fam]
   - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in 
  nfs_file_create [Fam]
 
  Peter Lieven (6):
block: add native support for NFS
qemu-iotests: change _supported_proto to file for various tests
qemu-iotests: enable support for NFS protocol
qemu-iotests: enable test 016 and 025 to work with NFS protocol
qemu-iotests: fix expected output of test 067
qemu-iotests: blacklist test 020 for NFS protocol
 
   MAINTAINERS  |5 +
   block/Makefile.objs  |1 +
   block/nfs.c  |  444 
  ++
   configure|   26 +++
   qapi-schema.json |1 +
   tests/qemu-iotests/013   |2 +-
   tests/qemu-iotests/014   |2 +-
   tests/qemu-iotests/016   |2 +-
   tests/qemu-iotests/018   |2 +-
   tests/qemu-iotests/019   |2 +-
   tests/qemu-iotests/020   |7 +-
   tests/qemu-iotests/023   |2 +-
   tests/qemu-iotests/024   |2 +-
   tests/qemu-iotests/025   |2 +-
   tests/qemu-iotests/026   |2 +-
   tests/qemu-iotests/028   |2 +-
   tests/qemu-iotests/031   |2 +-
   tests/qemu-iotests/034   |2 +-
   tests/qemu-iotests/036   |2 +-
   tests/qemu-iotests/037   |2 +-
   tests/qemu-iotests/038   |2 +-
   tests/qemu-iotests/039   |2 +-
   tests/qemu-iotests/043   |2 +-
   tests/qemu-iotests/046   |2 +-
   tests/qemu-iotests/052   |2 +-
   tests/qemu-iotests/054   |2 +-
   tests/qemu-iotests/059   |2 +-
   tests/qemu-iotests/060   |2 +-
   tests/qemu-iotests/061   |2 +-
   tests/qemu-iotests/063   |2 +-
   tests/qemu-iotests/067.out   |8 +-
   tests/qemu-iotests/069   |2 +-
   tests/qemu-iotests/common|   22 ++-
   tests/qemu-iotests/common.rc |3 +
   34 files changed, 534 insertions(+), 33 deletions(-)
   create mode 100644 block/nfs.c
  Any update on the qemu-iotests changes discussed in this thread?  The
  actual NFS patch looks fine.
 I would like to leave the rework of patch 5 to the maintainers.

 Maybe you can merge the rest of the series except for this change in patch 2.

 diff --git a/tests/qemu-iotests/020 b/tests/qemu-iotests/020
 index b3c86d8..a42f32f 100755
 --- a/tests/qemu-iotests/020

Re: [Qemu-devel] [PATCHv5] block: add native support for NFS

2014-01-10 Thread ronnie sahlberg

On Fri, Jan 10, 2014 at 4:30 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 10/01/2014 13:12, Peter Lieven ha scritto:
 Then I shall convert everything to a qapi schema whereby the current
 design of libnfs is designed to work with plain URLs.

 No, no one is asking you to do this.  URLs are fine, but I agree with
 Kevin that parsing them in QEMU is better.

 Also because the QEMU parser is known to be based on RFCs and good code
 from libxml2.  For example, the iSCSI URL parser, when introduced,
 didn't even have percent-escape parsing, causing libvirt to fail with
 old libiscsi (and actually not that old too: IIRC libiscsi 1.7 will
 still fail).  Unless the libnfs parser is as good as libxml2's, I think
 there's value in using the QEMU URI parser.


I think that is fair enough.

The arguments we are talking about are the type of arguments that only
affect the interface between libnfs and the nfs server itself
and is not strictly all that interesting to the application that links
to libnfs.

Since parsing a URL does require a fair amount of code, a hundred
lines or more, it is a bit annoying having to re-implement the parsing
code for every single small utility. For example  nfs-ls   nfs-cp
nfs-cpor for the parsing, that is still done, in the sg-utils
patch.
For a lot of these small and semi-trivial applications we don't really
care all that much about what the options are but we might care a lot
about making it easier to use libnfs and to avoid having to write a
parser each time.

For those use cases, I definitely think that having a built in
function to parse a url, and automatically update the nfs context with
connection related tweaks is a good thing. It eliminates the need to
re-implement the parsing functions in every single trivial
application.


For QEMU and libvirt things may be different. These are non-trivial
applications and may have needs to be able to control the settings
explicitely in the QEMU code.
That is still possible to do. All the url arguments so far tweak
arguments that can also be controlled through explicit existing APIs.
So for QEMU, there are functions available in libnfs now that will
automatically update the nfs context with things like UID/GID to use
when talking to the server, passed via the URL and QEMU can use them.
On the other hand, if QEMU rather wants to parse the URL itself
and make calls into the libnfs API to tweak these settings directly
from the QEMU codebase, that is also possible.


For example:   nfs://1.2.3.4/path/file?uid=10gid=10
When parsing these using the libnfs functions, the parsing functions
will automatically update the nfs context so that it will use these
values when it fills in the rpc header to send to the server.
But if you want to parse the url yourself, you can do that too, by
just calling   nfs_set_auth(nfs,  libnfs_authunix_create(..., 10, 10,
...


Regards
Ronnie Sahlberg

Re: [Qemu-devel] [PATCHv5] block: add native support for NFS

2014-01-10 Thread ronnie sahlberg

On Fri, Jan 10, 2014 at 8:10 AM, Peter Lieven p...@kamp.de wrote:

 Ronnie, can you also give a short advise on Kevin's question about short 
 reads.
 I think they can happen if we read beyond past EOF or not?


Short reads should normally not happen in libnfs itself since servers
are often careful always trying to sending back as much data as the
client requested.

There is a common exception though, for the case where you read past
the end of file.
So short reads should normally not happen. Unless QEMU or the guest
sends a request to libnfs to read past the end of the file.


If you send a READ for 1024 bytes to an nfs server at the offset 512
bytes from the end-of-file
then the server will respond with a read reply containing 512 bytes of
data  (and the eof flag set in the reply).

In my experience, most kernel/os based clients seem to be very careful
to never try to read beyond enf of file, so this rarely happens in
normal nfs.
(I only recall HPUX being a system where it would be common to always
issue nfs i/o in multiples of 4k   so for those clients it was very
important to make sure you implement short reads correctly in the
server).


I don't know how careful QEMU is in trying to prevent reading past the
end of the device or if it enforces it if the guest tries.
It is probably worth checking for short reads, at least for the case
where you might be reading past end of file.

Re: [Qemu-devel] [Qemu-ppc] [PATCH] spapr_vscsi: Fix REPORT_LUNS handling

2014-01-02 Thread ronnie sahlberg

On Thu, Jan 2, 2014 at 7:41 AM, Alexander Graf ag...@suse.de wrote:

 On 02.01.2014, at 16:31, Alexander Graf ag...@suse.de wrote:


 On 18.10.2013, at 14:33, Nathan Whitehorn nwhiteh...@freebsd.org wrote:

 Intercept REPORT_LUNS commands addressed either to SRP LUN 0 or the 
 well-known
 LUN for REPORT_LUNS commands. This is required to implement the SAM and SPC
 specifications.

 Since SRP implements only a single SCSI target port per connection, the SRP
 target is required to report all available LUNs in response to a REPORT_LUNS
 command addressed either to LUN 0 or the well-known LUN. Instead, QEMU was
 forwarding such requests to the first QEMU SCSI target, with the result that
 initiators that relied on this feature would only see LUNs on the first QEMU
 SCSI target.

 Behavior for REPORT_LUNS commands addressed to any other LUN is not 
 specified
 by the standard and so is left unchanged. This preserves behavior under 
 Linux
 and SLOF, which enumerate possible LUNs by hand and so address no commands
 either to LUN 0 or the well-known REPORT_LUNS LUN.

 Signed-off-by: Nathan Whitehorn nwhiteh...@freebsd.org

 This patch fails on checkpatch.pl. Please fix those warnings up :).

 WARNING: braces {} are necessary for all arms of this statement
 #65: FILE: hw/scsi/spapr_vscsi.c:738:
 +if (dev-channel == 0  dev-id == 0  dev-lun == 0)
 [...]

 WARNING: braces {} are necessary for all arms of this statement
 #81: FILE: hw/scsi/spapr_vscsi.c:754:
 +if (dev-id == 0  dev-channel == 0)
 [...]
 +else
 [...]

 WARNING: line over 80 characters
 #108: FILE: hw/scsi/spapr_vscsi.c:781:
 +if ((srp-cmd.lun == 0 || be64_to_cpu(srp-cmd.lun) == 
 SRP_REPORT_LUNS_WLUN)   srp-cmd.cdb[0] == REPORT_LUNS) {

 total: 0 errors, 3 warnings, 75 lines checked

 Your patch has style problems, please review.  If any of these errors
 are false positives report them to the maintainer, see
 CHECKPATCH in MAINTAINERS.

 ---
 hw/scsi/spapr_vscsi.c | 57 
 +++
 1 file changed, 57 insertions(+)

 diff --git a/hw/scsi/spapr_vscsi.c b/hw/scsi/spapr_vscsi.c
 index 2a26042..87e0fb3 100644
 --- a/hw/scsi/spapr_vscsi.c
 +++ b/hw/scsi/spapr_vscsi.c
 @@ -63,6 +63,8 @@
 #define SCSI_SENSE_BUF_SIZE 96
 #define SRP_RSP_SENSE_DATA_LEN  18

 +#define SRP_REPORT_LUNS_WLUN0xc101000
 +
 typedef union vscsi_crq {
struct viosrp_crq s;
uint8_t raw[16];
 @@ -720,12 +722,67 @@ static void vscsi_inquiry_no_target(VSCSIState *s, 
 vscsi_req *req)
}
 }

 +static void vscsi_report_luns(VSCSIState *s, vscsi_req *req)
 +{
 +BusChild *kid;
 +int i, len, n, rc;
 +uint8_t *resp_data;
 +bool found_lun0;
 +
 +n = 0;
 +found_lun0 = false;
 +QTAILQ_FOREACH(kid, s-bus.qbus.children, sibling) {
 +SCSIDevice *dev = SCSI_DEVICE(kid-child);
 +
 +n += 8;
 +if (dev-channel == 0  dev-id == 0  dev-lun == 0)
 +found_lun0 = true;
 +}
 +if (!found_lun0) {
 +n += 8;
 +}
 +len = n+8;

 Let me try to grasp what you're doing here. You're trying to figure out how 
 many devices there are attached to the bus. For every device you reserve a 
 buffer block. Lun0 is mandatory, all others are optional.

 First off, I think the code would be easier to grasp if you'd count number 
 of entries rather than number of bytes. That way we don't have to 
 mentally deal with the 8 byte block granularity.

 Then IIUC you're jumping through a lot of hoops to count lun0 if it's there, 
 but keep it reserved when it's not there. Why don't you just always reserve 
 entry 0 for lun0? In the loop where you're actually filling in data you just 
 skip lun0. Or is lun0 a terminator and always has to come last?


 +
 +resp_data = malloc(len);

 g_malloc0

 +memset(resp_data, 0, len);
 +stl_be_p(resp_data, n);
 +i = found_lun0 ? 8 : 16;
 +QTAILQ_FOREACH(kid, s-bus.qbus.children, sibling) {
 +DeviceState *qdev = kid-child;
 +SCSIDevice *dev = SCSI_DEVICE(qdev);
 +
 +if (dev-id == 0  dev-channel == 0)
 +resp_data[i] = 0;
 +else
 +resp_data[i] = (2  6);

This looks odd.
Shouldn't this rather be
 resp_data[i] = (1  6);
to set the LUN address method to 01b  meaning Single Level LUN structure.
(SAM5 4.7.3)

He is setting the address method to 10b but there is no such address
method afaik.



 Ah, I almost forgot this one. Please convert that into something more verbose 
 through a define. Whatever that bit means ... :).

 +resp_data[i] |= dev-id;

He should do something like :
resp_data[i] |= dev-id  0x3f;
here to avoid a dev-id  63 from spilling into the address method field.

Or probably should have a check for
if dev-id  3  then fail



 +resp_data[i+1] = (dev-channel  5);
 +resp_data[i+1] |= dev-lun;

 What are the other 6 bytes reserved for?


 Alex

Re: [Qemu-devel] [PATCHv4] block: add native support for NFS

2013-12-25 Thread ronnie sahlberg

On Wed, Dec 25, 2013 at 9:42 PM, Fam Zheng f...@redhat.com wrote:
 On 2013年12月21日 00:04, Peter Lieven wrote:

 This patch adds native support for accessing images on NFS shares without
 the requirement to actually mount the entire NFS share on the host.

 NFS Images can simply be specified by an url of the form:
 nfs://host/export/filename

 For example:
 qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2

 You need LibNFS from Ronnie Sahlberg available at:
 git://github.com/sahlberg/libnfs.git
 for this to work.

 During configure it is automatically probed for libnfs and support
 is enabled on-the-fly. You can forbid or enforce libnfs support
 with --disable-libnfs or --enable-libnfs respectively.

 Due to NFS restrictions you might need to execute your binaries
 as root, allow them to open priviledged ports (1024) or specify
 insecure option on the NFS server.


 What are the error messages like, if no privilege. Is root always required
 for this to work?

NFS servers often default to only allow client connections that
originates from a system port.
I know three different ways to solve this:

1, Run QEMU as root, which allows libnfs to bind to a system port.
This is probably suboptimal since I guess most people would want to
avoid running qemu as root if they can avoid it.

2, Change the NFS server to allow connections from nonsystem ports. On
linux NFS servers this is done by adding
insecure as the export option in /etc/exports.
This may be preferable to option 1 (since secure/insecure does not
really add much security in the first place).

3, Assign the capability to qemu to bind to system ports when running
as non-root user.
This is probably the most attractive option of the three.
You can still run qemu as non-root  and you dont have to change the
security mode on the NFS server.
It is highly non-portable though and only work on platforms that
provide capabilities.
On linux you add this capability using :
sudo setcap 'cap_net_bind_service=+ep' /path/to/executable


regards
ronnie sahlberg

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-20 Thread ronnie sahlberg

The sync calls uses a trivial eventloop built into libnfs using poll().

Mixing the _async() and _sync() interfaces in libnfs means you may
risk running nested eventloops. Pain and tears lie behind that door.

On Fri, Dec 20, 2013 at 6:43 AM, Peter Lieven p...@kamp.de wrote:
 On 20.12.2013 15:38, Stefan Hajnoczi wrote:

 On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven p...@kamp.de wrote:

 On 20.12.2013 14:57, Stefan Hajnoczi wrote:

 On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven p...@kamp.de wrote:

 On 20.12.2013 13:19, Stefan Hajnoczi wrote:

 On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:

 On 17.12.2013 17:47, Stefan Hajnoczi wrote:

 On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:

 +/* set to -ENOTSUP since bdrv_allocated_file_size is only used
 + * in qemu-img open. So we can use the cached value for
 allocate
 + * filesize obtained from fstat at open time */
 +client-allocated_file_size = -ENOTSUP;

 Can you implement this fully?  By stubbing it out like this we won't
 be
 able to call get_allocated_file_size() at runtime in the future
 without
 updating the nfs block driver code.  It's just an fstat call,
 shouldn't
 be too hard to implement properly :).

 It seems I have to leave it as is currently.
 bdrv_get_allocated_file_size
 is not in a coroutine context. I get coroutine yields to no one.

 Create a coroutine and pump the event loop until it has reached
 completion:

 co = qemu_coroutine_create(my_coroutine_fn, ...);
 qemu_coroutine_enter(co, foo);
 while (!complete) {
qemu_aio_wait();
 }

 See block.c for similar examples.

 Wouldn't it make sense to make this modification to
 bdrv_get_allocated_file_size in
 block.c rather than in client/nfs.c and in the future potentially other
 drivers?

 If yes, I would ask you to take v3 of the NFS protocol patch and I
 promise
 to send
 a follow up early next year to make this modification to block.c and
 change
 block/nfs.c
 and other implementations to be a coroutine_fn.

 .bdrv_get_allocated_file_size() implementations in other block drivers
 are synchronous.  Making the block driver interface use coroutines
 would be wrong unless all the block drivers were updated to use
 coroutines too.

 I can do that. I think its not too complicated because all those
 implementations do not rely on callbacks. It should be possible
 to just rename the existing implemenations to lets say
 .bdrv_co_get_allocated_file_size and call them inside a coroutine.

 No, that would be wrong because coroutine functions should not block.
 The point of coroutines is that if they cannot proceed they must yield
 so the event loop regains control.  If you simply rename the function
 to _co_ then they will block the event loop and not be true coroutine
 functions.

 Can you just call nfs_fstat() (the sync libnfs interface)?

 I can only do that if its guaranteed that no other requests are in flight
 otherwise it will mess up.

 How will it mess up?

 The sync calls into libnfs are just wrappers around the async calls.
 The problem is that this wrapper will handle all the callbacks for the
 in-flight requests and they will never return.

 Peter

Re: [Qemu-devel] [PATCH] block/iscsi: return -ENOMEM if an async call fails immediately

2013-12-20 Thread ronnie sahlberg

Looks good.

Reviewed-by: Ronnie Sahlberg ronniesahlb...@gmail.com

On Fri, Dec 20, 2013 at 1:02 AM, Peter Lieven p...@kamp.de wrote:
 if an async libiscsi call fails directly it can only be due
 to an out of memory condition. All other errors are returned
 through the callback.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |   12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index 56c0799..657a348 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -308,7 +308,7 @@ retry:
  iscsi_co_generic_cb, iTask);
  if (iTask.task == NULL) {
  g_free(buf);
 -return -EIO;
 +return -ENOMEM;
  }
  #if defined(LIBISCSI_FEATURE_IOVECTOR)
  scsi_task_set_iov_out(iTask.task, (struct scsi_iovec *) iov-iov,
 @@ -376,7 +376,7 @@ retry:
  break;
  }
  if (iTask.task == NULL) {
 -return -EIO;
 +return -ENOMEM;
  }
  #if defined(LIBISCSI_FEATURE_IOVECTOR)
  scsi_task_set_iov_in(iTask.task, (struct scsi_iovec *) iov-iov, 
 iov-niov);
 @@ -419,7 +419,7 @@ static int coroutine_fn iscsi_co_flush(BlockDriverState 
 *bs)
  retry:
  if (iscsi_synchronizecache10_task(iscsilun-iscsi, iscsilun-lun, 0, 0, 
 0,
0, iscsi_co_generic_cb, iTask) == 
 NULL) {
 -return -EIO;
 +return -ENOMEM;
  }

  while (!iTask.complete) {
 @@ -669,7 +669,7 @@ retry:
sector_qemu2lun(sector_num, iscsilun),
8 + 16, iscsi_co_generic_cb,
iTask) == NULL) {
 -ret = -EIO;
 +ret = -ENOMEM;
  goto out;
  }

 @@ -753,7 +753,7 @@ coroutine_fn iscsi_co_discard(BlockDriverState *bs, 
 int64_t sector_num,
  retry:
  if (iscsi_unmap_task(iscsilun-iscsi, iscsilun-lun, 0, 0, list, 1,
   iscsi_co_generic_cb, iTask) == NULL) {
 -return -EIO;
 +return -ENOMEM;
  }

  while (!iTask.complete) {
 @@ -822,7 +822,7 @@ retry:
 iscsilun-zeroblock, iscsilun-block_size,
 nb_blocks, 0, !!(flags  BDRV_REQ_MAY_UNMAP),
 0, 0, iscsi_co_generic_cb, iTask) == NULL) {
 -return -EIO;
 +return -ENOMEM;
  }

  while (!iTask.complete) {
 --
 1.7.9.5

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-18 Thread ronnie sahlberg

On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman owass...@redhat.com wrote:
 On 12/18/2013 01:03 AM, Peter Lieven wrote:



 Am 17.12.2013 um 18:32 schrieb Daniel P. Berrange
 berra...@redhat.com:

 On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
 This patch adds native support for accessing images on NFS shares
 without
 the requirement to actually mount the entire NFS share on the host.

 NFS Images can simply be specified by an url of the form:
 nfs://host/export/filename

 For example:
 qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2


 Does it support other config tunables, eg specifying which
 NFS version to use 2/3/4 ? If so will they be available as
 URI parameters in the obvious manner ?


 currently only v3 is supported by libnfs. what other tunables would you
 like to see?


 For live migration we need the sync option (async ignores O_SYNC and
 O_DIRECT sadly),
 will it be supported? or will it be the default?


If you use the high-level API that provides posix like functions, such
as nfs_open() then libnfs does.
nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
the O_SYNC flag in modes.

By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
NFS/WRITE3+UNSTABLE that allows the server to just write to
cache/memory.

IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
then libnfs will flag this handle as sync and any calls to
nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC

Calls to nfs_fsync is translated to NFS/COMMIT3



Of course, as for normal file I/O this is useful but not great since
you can only control the sync vs async per open filehandle.
Libnfs does also allow you to send raw rpc commands to the server and
using this API you can control the sync behaviour for individual
writes.
This means you coould do something like
* emulate SCSI to the guest.
* if guest sends SCSI/WRITE* without any FUA bits set, then for that
I/O you send a NFS3/WRITE+UNSTABLE
* if guest sends SCSI/WRITE* with FUA bits set, then for that I/O you
send NFS3/WRITE+FILE_SYNC
and then the guest kernel can control for each individual write
whether it is sync or not.

But that is probably something that can wait until later and don't
need to be part of the initial patch?
If peter wants to do this in the future I can create a small writeup
on how to mixin the two different APIs using the same context.

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-18 Thread ronnie sahlberg

On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven p...@kamp.de wrote:

 Am 18.12.2013 um 15:42 schrieb ronnie sahlberg ronniesahlb...@gmail.com:

 On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman owass...@redhat.com wrote:
 On 12/18/2013 01:03 AM, Peter Lieven wrote:



 Am 17.12.2013 um 18:32 schrieb Daniel P. Berrange
 berra...@redhat.com:

 On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
 This patch adds native support for accessing images on NFS shares
 without
 the requirement to actually mount the entire NFS share on the host.

 NFS Images can simply be specified by an url of the form:
 nfs://host/export/filename

 For example:
 qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2


 Does it support other config tunables, eg specifying which
 NFS version to use 2/3/4 ? If so will they be available as
 URI parameters in the obvious manner ?


 currently only v3 is supported by libnfs. what other tunables would you
 like to see?


 For live migration we need the sync option (async ignores O_SYNC and
 O_DIRECT sadly),
 will it be supported? or will it be the default?


 If you use the high-level API that provides posix like functions, such
 as nfs_open() then libnfs does.
 nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
 the O_SYNC flag in modes.

 By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
 NFS/WRITE3+UNSTABLE that allows the server to just write to
 cache/memory.

 IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
 then libnfs will flag this handle as sync and any calls to
 nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC

 Calls to nfs_fsync is translated to NFS/COMMIT3

 If this NFS/COMMIT3 would issue a sync on the server that would be all we
 actually need.

You have that guarantee in NFS/COMMIT3
NFS/COMMIT3 will not return until the server has flushed the specified
range to disk.

However, while the NFS protocol allows you to specify a range for the
COMMIT3 call so that you can do things like
WRITE3 Offset:foo Length:bar
COMMIT3 Offset:foo Length:bar
many/most nfs servers will ignore the offset/length arguments to the
COMMIT3 call and always unconditionally make an fsync() for the whole
file.

This can make the COMMIT3 call very expensive for large files.


NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
specify O_SYNC to nfs_open*()
In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
which means that the server will guarantee to write the data to stable
storage before responding back to the client.
In this mode there is no real need to do anything at all or even call
COMMIT3  since there is never any writeback data on the server that
needs to be destaged.


Since many servers treat COMMIT3 as unconditionally walk all blocks
for the whole file and make sure they are destaged it is not clear
whether how

WRITE3-normal Offset:foo Length:bar
COMMIT3 Offset:foo Length:bar

will compare to

WRITE3+O_SYNC Offset:foo Length:bar

I would not be surprised if the second mode would have higher
(potentially significantly) performance than the former.



 And in this case migration should be safe. Even if we open
 a file with cache = none qemu would issue such a commit after every write.

 This also allow for writeback caching where the filesystem flushes would
 go through right to the server.





 Of course, as for normal file I/O this is useful but not great since
 you can only control the sync vs async per open filehandle.
 Libnfs does also allow you to send raw rpc commands to the server and
 using this API you can control the sync behaviour for individual
 writes.
 This means you coould do something like
 * emulate SCSI to the guest.
 * if guest sends SCSI/WRITE* without any FUA bits set, then for that
 I/O you send a NFS3/WRITE+UNSTABLE
 * if guest sends SCSI/WRITE* with FUA bits set, then for that I/O you
 send NFS3/WRITE+FILE_SYNC
 and then the guest kernel can control for each individual write
 whether it is sync or not.

 But that is probably something that can wait until later and don't
 need to be part of the initial patch?
 If peter wants to do this in the future I can create a small writeup
 on how to mixin the two different APIs using the same context.

 We can do that, but I would like to focus on the basic functionality
 first.

ACK


 Peter

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-18 Thread ronnie sahlberg

On Wed, Dec 18, 2013 at 9:42 AM, Peter Lieven p...@kamp.de wrote:

 Am 18.12.2013 um 18:33 schrieb ronnie sahlberg ronniesahlb...@gmail.com:

 On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven p...@kamp.de wrote:

 Am 18.12.2013 um 15:42 schrieb ronnie sahlberg ronniesahlb...@gmail.com:

 On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman owass...@redhat.com 
 wrote:
 On 12/18/2013 01:03 AM, Peter Lieven wrote:



 Am 17.12.2013 um 18:32 schrieb Daniel P. Berrange
 berra...@redhat.com:

 On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
 This patch adds native support for accessing images on NFS shares
 without
 the requirement to actually mount the entire NFS share on the host.

 NFS Images can simply be specified by an url of the form:
 nfs://host/export/filename

 For example:
 qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2


 Does it support other config tunables, eg specifying which
 NFS version to use 2/3/4 ? If so will they be available as
 URI parameters in the obvious manner ?


 currently only v3 is supported by libnfs. what other tunables would you
 like to see?


 For live migration we need the sync option (async ignores O_SYNC and
 O_DIRECT sadly),
 will it be supported? or will it be the default?


 If you use the high-level API that provides posix like functions, such
 as nfs_open() then libnfs does.
 nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
 the O_SYNC flag in modes.

 By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
 NFS/WRITE3+UNSTABLE that allows the server to just write to
 cache/memory.

 IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
 then libnfs will flag this handle as sync and any calls to
 nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC

 Calls to nfs_fsync is translated to NFS/COMMIT3

 If this NFS/COMMIT3 would issue a sync on the server that would be all we
 actually need.

 You have that guarantee in NFS/COMMIT3
 NFS/COMMIT3 will not return until the server has flushed the specified
 range to disk.

 However, while the NFS protocol allows you to specify a range for the
 COMMIT3 call so that you can do things like
 WRITE3 Offset:foo Length:bar
 COMMIT3 Offset:foo Length:bar
 many/most nfs servers will ignore the offset/length arguments to the
 COMMIT3 call and always unconditionally make an fsync() for the whole
 file.

 This can make the COMMIT3 call very expensive for large files.


 NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
 specify O_SYNC to nfs_open*()
 In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
 which means that the server will guarantee to write the data to stable
 storage before responding back to the client.
 In this mode there is no real need to do anything at all or even call
 COMMIT3  since there is never any writeback data on the server that
 needs to be destaged.


 Since many servers treat COMMIT3 as unconditionally walk all blocks
 for the whole file and make sure they are destaged it is not clear
 whether how

 WRITE3-normal Offset:foo Length:bar
 COMMIT3 Offset:foo Length:bar

 will compare to

 WRITE3+O_SYNC Offset:foo Length:bar

 I would not be surprised if the second mode would have higher
 (potentially significantly) performance than the former.

 The qemu block layer currently is designed to send a bdrv_flush after every 
 single
 write if the write cache is not enabled. This means that the unwritten data 
 is just
 the data of the single write operation.

I understand that, there is only a single WRITE3 worth of data to
actually destage each time.

But what I meant is that for a lot of servers, for large files,   the
server might need to spend non-trivial amount of time
crunching file metadata and check every single page for the file in
order to discover the I only need to destage pages x,y,z

On many nfs servers this figure out which blocks to flush can take a
lot of time and affect performance greatly.



 However, changing this to issue a sync
 write call would require to change the whole API. The major problem is that
 the write cache setting can be changed while the device is open otherwise
 we could just ignore all calls to bdrv flush if the device was opened without
 enabled write cache.

 In the very popular case of using Virtio as Driver it is the case that the 
 device
 is always opened with disabled write cache and the write cache is only
 enabled after the host has negotiated with the guest that the guest is
 able to send flushed.

 We can keep in mind for a later version of the driver that we manually craft
 a write call with O_SYNC if the write cache is disabled and ignore bdrv_flush.
 And we use async write + commit via bdrv_flush in the case of an enabled
 write cache.

 Peter

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-17 Thread ronnie sahlberg

NFSTask

Task is a very scsi-ish term. Maybe RPC is better ?

NFSrpc ?



On Tue, Dec 17, 2013 at 1:15 AM, Peter Lieven p...@kamp.de wrote:
 This patch adds native support for accessing images on NFS shares without
 the requirement to actually mount the entire NFS share on the host.

 NFS Images can simply be specified by an url of the form:
 nfs://host/export/filename

 For example:
 qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2

 You need libnfs from Ronnie Sahlberg available at:
git://github.com/sahlberg/libnfs.git
 for this to work.

 During configure it is automatically probed for libnfs and support
 is enabled on-the-fly. You can forbid or enforce libnfs support
 with --disable-libnfs or --enable-libnfs respectively.

 Due to NFS restrictions you might need to execute your binaries
 as root, allow them to open priviledged ports (1024) or specify
 insecure option on the NFS server.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
 v1-v2:
  - fixed block/Makefile.objs [Ronnie]
  - do not always register a read handler [Ronnie]
  - add support for reading beyond EOF [Fam]
  - fixed struct and paramter naming [Fam]
  - fixed overlong lines and whitespace errors [Fam]
  - return return status from libnfs whereever possible [Fam]
  - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
  - avoid segfault when parsing filname [Fam]
  - remove unused close_bh from NFSClient [Fam]
  - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in 
 nfs_file_create [Fam]

  MAINTAINERS |5 +
  block/Makefile.objs |1 +
  block/nfs.c |  419 
 +++
  configure   |   38 +
  4 files changed, 463 insertions(+)
  create mode 100644 block/nfs.c

 diff --git a/MAINTAINERS b/MAINTAINERS
 index c19133f..f53d184 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -899,6 +899,11 @@ M: Peter Lieven p...@kamp.de
  S: Supported
  F: block/iscsi.c

 +NFS
 +M: Peter Lieven p...@kamp.de
 +S: Maintained
 +F: block/nfs.c
 +
  SSH
  M: Richard W.M. Jones rjo...@redhat.com
  S: Supported
 diff --git a/block/Makefile.objs b/block/Makefile.objs
 index f43ecbc..aa8eaf9 100644
 --- a/block/Makefile.objs
 +++ b/block/Makefile.objs
 @@ -12,6 +12,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
  ifeq ($(CONFIG_POSIX),y)
  block-obj-y += nbd.o sheepdog.o
  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
 +block-obj-$(CONFIG_LIBNFS) += nfs.o
  block-obj-$(CONFIG_CURL) += curl.o
  block-obj-$(CONFIG_RBD) += rbd.o
  block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 diff --git a/block/nfs.c b/block/nfs.c
 new file mode 100644
 index 000..006b8cc
 --- /dev/null
 +++ b/block/nfs.c
 @@ -0,0 +1,419 @@
 +/*
 + * QEMU Block driver for native access to files on NFS shares
 + *
 + * Copyright (c) 2013 Peter Lieven p...@kamp.de
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include config-host.h
 +
 +#include poll.h
 +#include qemu-common.h
 +#include qemu/config-file.h
 +#include qemu/error-report.h
 +#include block/block_int.h
 +#include trace.h
 +#include qemu/iov.h
 +#include sysemu/sysemu.h
 +
 +#include nfsc/libnfs-zdr.h
 +#include nfsc/libnfs.h
 +#include nfsc/libnfs-raw.h
 +#include nfsc/libnfs-raw-mount.h
 +
 +typedef struct nfsclient {
 +struct nfs_context *context;
 +struct nfsfh *fh;
 +int events;
 +bool has_zero_init;
 +int64_t allocated_file_size;
 +} NFSClient;
 +
 +typedef struct NFSTask {
 +int status;
 +int complete;
 +QEMUIOVector *iov;
 +Coroutine *co;
 +QEMUBH *bh;
 +} NFSTask;
 +
 +static void nfs_process_read(void *arg);
 +static void nfs_process_write(void *arg);
 +
 +static void nfs_set_events(NFSClient *client)
 +{
 +int ev = nfs_which_events(client-context);
 +if (ev != client-events) {
 +qemu_aio_set_fd_handler(nfs_get_fd(client-context),
 +  (ev  POLLIN) ? nfs_process_read

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-17 Thread ronnie sahlberg

On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven p...@kamp.de wrote:
 On 17.12.2013 17:47, Stefan Hajnoczi wrote:
...
 Which NFS protocol versions are supported by current libnfs?

 Will check that out. Ronnie?


It uses NFS v3 only.

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-17 Thread ronnie sahlberg

On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven p...@kamp.de wrote:
 On 17.12.2013 17:47, Stefan Hajnoczi wrote:

 On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:

...
 +if (nfs_pwrite_async(client-context, client-fh,
 + sector_num * BDRV_SECTOR_SIZE,
 + nb_sectors * BDRV_SECTOR_SIZE,
 + buf, nfs_co_generic_cb, task) != 0) {
 +g_free(buf);
 +return -EIO;

 Can we get a more detailed errno here?  (e.g. ENOSPC)

 libnfs only returns 0 or -1 if the setup of the call
 fails. the status code from the RPC is more detailed
 and available in task.status.


The *_async() functions only allocates memory and marshalls the
arguments to the buffer.
So barring marshalling bugs, it will only fail due to OOM.

So -ENOMEM is perhaps a better error for when *_async() returns an error.
That is really the only condition where these functions will fail.


If *_async() returns success you are guaranteed that
nfs_co_generic_cb() will be invoked
and there you can inspect the status argument for more detailed reason why.

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS

2013-12-17 Thread ronnie sahlberg

On Tue, Dec 17, 2013 at 2:36 PM, Peter Lieven p...@kamp.de wrote:


 Am 17.12.2013 um 18:13 schrieb ronnie sahlberg ronniesahlb...@gmail.com:

 On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven p...@kamp.de wrote:
 On 17.12.2013 17:47, Stefan Hajnoczi wrote:
 ...
 Which NFS protocol versions are supported by current libnfs?

 Will check that out. Ronnie?

 It uses NFS v3 only.

 should we use nfs3:// for the urls then?

No, I think we should leave it as nfs://... so that we are compatilbe
with rfc2224

Once/if/when I add support for v2 and v4 we can force a protocol
version using ?version=2

Then
nfs://server/foo/bar   would be use whatever versions the server offers
but
nfs://server/foo/bar?version=2 would become use version 2 only

Re: [Qemu-devel] [PATCH] block: add native support for NFS

2013-12-16 Thread ronnie sahlberg

Nice.

+block-obj-$(CONFIG_LIBISCSI) += nfs.

Should be CONFIG_LIBNFS

On Mon, Dec 16, 2013 at 7:34 AM, Peter Lieven p...@kamp.de wrote:
 This patch adds native support for accessing images on NFS shares without
 the requirement to actually mount the entire NFS share on the host.

 NFS Images can simply be specified by an url of the form:
 nfs://host/export/filename

 For example:
 qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2

 You need libnfs from Ronnie Sahlberg available at:
git://github.com/sahlberg/libnfs.git
 for this to work.

 During configure it is automatically probed for libnfs and support
 is enabled on-the-fly. You can forbid or enforce libnfs support
 with --disable-libnfs or --enable-libnfs respectively.

 Due to NFS restrictions you might need to execute your binaries
 as root, allow them to open priviledged ports (1024) or specify
 insecure option on the NFS server.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  MAINTAINERS |5 +
  block/Makefile.objs |1 +
  block/nfs.c |  420 
 +++
  configure   |   38 +
  4 files changed, 464 insertions(+)
  create mode 100644 block/nfs.c

 diff --git a/MAINTAINERS b/MAINTAINERS
 index c19133f..f53d184 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -899,6 +899,11 @@ M: Peter Lieven p...@kamp.de
  S: Supported
  F: block/iscsi.c

 +NFS
 +M: Peter Lieven p...@kamp.de
 +S: Maintained
 +F: block/nfs.c
 +
  SSH
  M: Richard W.M. Jones rjo...@redhat.com
  S: Supported
 diff --git a/block/Makefile.objs b/block/Makefile.objs
 index f43ecbc..1bac94e 100644
 --- a/block/Makefile.objs
 +++ b/block/Makefile.objs
 @@ -12,6 +12,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
  ifeq ($(CONFIG_POSIX),y)
  block-obj-y += nbd.o sheepdog.o
  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
 +block-obj-$(CONFIG_LIBISCSI) += nfs.o
  block-obj-$(CONFIG_CURL) += curl.o
  block-obj-$(CONFIG_RBD) += rbd.o
  block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 diff --git a/block/nfs.c b/block/nfs.c
 new file mode 100644
 index 000..d6cb4c0
 --- /dev/null
 +++ b/block/nfs.c
 @@ -0,0 +1,420 @@
 +/*
 + * QEMU Block driver for native access to files on NFS shares
 + *
 + * Copyright (c) 2013 Peter Lieven p...@kamp.de
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include config-host.h
 +
 +#include poll.h
 +#include arpa/inet.h
 +#include qemu-common.h
 +#include qemu/config-file.h
 +#include qemu/error-report.h
 +#include block/block_int.h
 +#include trace.h
 +#include block/scsi.h
 +#include qemu/iov.h
 +#include sysemu/sysemu.h
 +#include qmp-commands.h
 +
 +#include nfsc/libnfs-zdr.h
 +#include nfsc/libnfs.h
 +#include nfsc/libnfs-raw.h
 +#include nfsc/libnfs-raw-mount.h
 +
 +typedef struct nfsclient {
 +struct nfs_context *context;
 +struct nfsfh *fh;
 +int events;
 +bool has_zero_init;
 +int64_t allocated_file_size;
 +QEMUBH *close_bh;
 +} nfsclient;
 +
 +typedef struct NFSTask {
 +int status;
 +int complete;
 +QEMUIOVector *iov;
 +Coroutine *co;
 +QEMUBH *bh;
 +} NFSTask;
 +
 +static void nfs_process_read(void *arg);
 +static void nfs_process_write(void *arg);
 +
 +static void nfs_set_events(nfsclient *client)
 +{
 +int ev;
 +/* We always register a read handler.  */
 +ev = POLLIN;
 +ev |= nfs_which_events(client-context);
 +if (ev != client-events) {
 +qemu_aio_set_fd_handler(nfs_get_fd(client-context),
 +  nfs_process_read,
 +  (ev  POLLOUT) ? nfs_process_write : NULL,
 +  client);
 +
 +}
 +client-events = ev;
 +}
 +
 +static void nfs_process_read(void *arg)
 +{
 +nfsclient *client = arg;
 +nfs_service(client-context, POLLIN);
 +nfs_set_events(client);
 +}
 +
 +static void nfs_process_write(void *arg)
 +{
 +nfsclient *client = arg;
 +nfs_service(client-context

Re: [Qemu-devel] [PATCH] block: add native support for NFS

2013-12-16 Thread ronnie sahlberg

On Mon, Dec 16, 2013 at 7:34 AM, Peter Lieven p...@kamp.de wrote:
 This patch adds native support for accessing images on NFS shares without
 the requirement to actually mount the entire NFS share on the host.

 NFS Images can simply be specified by an url of the form:
 nfs://host/export/filename

 For example:
 qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2

 You need libnfs from Ronnie Sahlberg available at:
git://github.com/sahlberg/libnfs.git
 for this to work.

 During configure it is automatically probed for libnfs and support
 is enabled on-the-fly. You can forbid or enforce libnfs support
 with --disable-libnfs or --enable-libnfs respectively.

 Due to NFS restrictions you might need to execute your binaries
 as root, allow them to open priviledged ports (1024) or specify
 insecure option on the NFS server.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  MAINTAINERS |5 +
  block/Makefile.objs |1 +
  block/nfs.c |  420 
 +++
  configure   |   38 +
  4 files changed, 464 insertions(+)
  create mode 100644 block/nfs.c

 diff --git a/MAINTAINERS b/MAINTAINERS
 index c19133f..f53d184 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -899,6 +899,11 @@ M: Peter Lieven p...@kamp.de
  S: Supported
  F: block/iscsi.c

 +NFS
 +M: Peter Lieven p...@kamp.de
 +S: Maintained
 +F: block/nfs.c
 +
  SSH
  M: Richard W.M. Jones rjo...@redhat.com
  S: Supported
 diff --git a/block/Makefile.objs b/block/Makefile.objs
 index f43ecbc..1bac94e 100644
 --- a/block/Makefile.objs
 +++ b/block/Makefile.objs
 @@ -12,6 +12,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
  ifeq ($(CONFIG_POSIX),y)
  block-obj-y += nbd.o sheepdog.o
  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
 +block-obj-$(CONFIG_LIBISCSI) += nfs.o

CONFIG_LIBNFS

  block-obj-$(CONFIG_CURL) += curl.o
  block-obj-$(CONFIG_RBD) += rbd.o
  block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 diff --git a/block/nfs.c b/block/nfs.c
 new file mode 100644
 index 000..d6cb4c0
 --- /dev/null
 +++ b/block/nfs.c
 @@ -0,0 +1,420 @@
 +/*
 + * QEMU Block driver for native access to files on NFS shares
 + *
 + * Copyright (c) 2013 Peter Lieven p...@kamp.de
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include config-host.h
 +
 +#include poll.h
 +#include arpa/inet.h
 +#include qemu-common.h
 +#include qemu/config-file.h
 +#include qemu/error-report.h
 +#include block/block_int.h
 +#include trace.h
 +#include block/scsi.h
 +#include qemu/iov.h
 +#include sysemu/sysemu.h
 +#include qmp-commands.h
 +
 +#include nfsc/libnfs-zdr.h
 +#include nfsc/libnfs.h
 +#include nfsc/libnfs-raw.h
 +#include nfsc/libnfs-raw-mount.h
 +
 +typedef struct nfsclient {
 +struct nfs_context *context;
 +struct nfsfh *fh;
 +int events;
 +bool has_zero_init;
 +int64_t allocated_file_size;
 +QEMUBH *close_bh;
 +} nfsclient;
 +
 +typedef struct NFSTask {
 +int status;
 +int complete;
 +QEMUIOVector *iov;
 +Coroutine *co;
 +QEMUBH *bh;
 +} NFSTask;
 +
 +static void nfs_process_read(void *arg);
 +static void nfs_process_write(void *arg);
 +
 +static void nfs_set_events(nfsclient *client)
 +{
 +int ev;
 +/* We always register a read handler.  */
 +ev = POLLIN;

I don't think you should always register a read handler here since
there are no situations where the NFS server
will send unsolicited packets back to the client (contrary to iSCSI
where the target might send NOPs back to ping the client).

When NFS sessions are idle, it is common that servers will tear down
the TCP connection to save resources and rely on that
the clients will transparently reconnect the TPC session again once
there is new I/O to send.
This reconnect is handled automatically/transparently within libnfs.

 +ev |= nfs_which_events(client-context);
 +if (ev != client-events) {
 +qemu_aio_set_fd_handler

Re: [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME

2013-11-19 Thread ronnie sahlberg

+#define SCSI_WRITE_SAME_MAX 524288
...
+data-iov.iov_len = MIN(data-nb_sectors * 512, SCSI_WRITE_SAME_MAX);

I don't  think you should just clamp the data to 512k, instead I think
you should report the 512k max write same size through
BlockLimitsVPD/MaximumWriteSameLength to the initiator.
Then instead of clamping the write to MIN()  you return a check
condition if the initiator sends too much.

On Tue, Nov 19, 2013 at 9:07 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Fetch the data to be written from the input buffer.  If it is all zeroes,
 we can use the write_zeroes call (possibly with the new MAY_UNMAP flag).
 Otherwise, do as many write cycles as needed, writing 512k at a time.

 Strictly speaking, this is still incorrect because a zero cluster should
 only be written if the MAY_UNMAP flag is set.  But this is a bug in qcow2
 and the other formats, not in the SCSI code.

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  hw/scsi/scsi-disk.c | 140 
 +++-
  1 file changed, 116 insertions(+), 24 deletions(-)

 diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
 index 0640bb0..4cc6a28 100644
 --- a/hw/scsi/scsi-disk.c
 +++ b/hw/scsi/scsi-disk.c
 @@ -41,6 +41,7 @@ do { printf(scsi-disk:  fmt , ## __VA_ARGS__); } while (0)
  #include scsi/sg.h
  #endif

 +#define SCSI_WRITE_SAME_MAX 524288
  #define SCSI_DMA_BUF_SIZE   131072
  #define SCSI_MAX_INQUIRY_LEN256
  #define SCSI_MAX_MODE_LEN   256
 @@ -634,6 +635,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
 uint8_t *outbuf)
  buflen = 0x40;
  memset(outbuf + 4, 0, buflen - 4);

 +outbuf[4] = 0x1; /* wsnz */
 +
  /* optimal transfer length granularity */
  outbuf[6] = (min_io_size  8)  0xff;
  outbuf[7] = min_io_size  0xff;
 @@ -1589,6 +1592,111 @@ invalid_field:
  scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
  }

 +typedef struct WriteSameCBData {
 +SCSIDiskReq *r;
 +int64_t sector;
 +int nb_sectors;
 +QEMUIOVector qiov;
 +struct iovec iov;
 +} WriteSameCBData;
 +
 +static void scsi_write_same_complete(void *opaque, int ret)
 +{
 +WriteSameCBData *data = opaque;
 +SCSIDiskReq *r = data-r;
 +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r-req.dev);
 +
 +assert(r-req.aiocb != NULL);
 +r-req.aiocb = NULL;
 +bdrv_acct_done(s-qdev.conf.bs, r-acct);
 +if (r-req.io_canceled) {
 +goto done;
 +}
 +
 +if (ret  0) {
 +if (scsi_handle_rw_error(r, -ret)) {
 +goto done;
 +}
 +}
 +
 +data-nb_sectors -= data-iov.iov_len / 512;
 +data-sector += data-iov.iov_len / 512;
 +data-iov.iov_len = MIN(data-nb_sectors * 512, data-iov.iov_len);
 +if (data-iov.iov_len) {
 +bdrv_acct_start(s-qdev.conf.bs, r-acct, data-iov.iov_len, 
 BDRV_ACCT_WRITE);
 +r-req.aiocb = bdrv_aio_writev(s-qdev.conf.bs, data-sector,
 +   data-qiov, data-iov.iov_len / 512,
 +   scsi_write_same_complete, r);
 +return;
 +}
 +
 +scsi_req_complete(r-req, GOOD);
 +
 +done:
 +if (!r-req.io_canceled) {
 +scsi_req_unref(r-req);
 +}
 +g_free(data-iov.iov_base);
 +g_free(data);
 +}
 +
 +static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
 +{
 +SCSIRequest *req = r-req;
 +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req-dev);
 +uint32_t nb_sectors = scsi_data_cdb_length(r-req.cmd.buf);
 +WriteSameCBData *data;
 +uint8_t *buf;
 +int i;
 +
 +/* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
 +if (nb_sectors == 0 || (req-cmd.buf[1]  0x16)) {
 +scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 +return;
 +}
 +
 +if (bdrv_is_read_only(s-qdev.conf.bs)) {
 +scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
 +return;
 +}
 +if (!check_lba_range(s, r-req.cmd.lba, nb_sectors)) {
 +scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 +return;
 +}
 +
 +if (buffer_is_zero(inbuf, s-qdev.blocksize)) {
 +int flags = (req-cmd.buf[1]  0x8) ? BDRV_REQ_MAY_UNMAP : 0;
 +
 +/* The request is used as the AIO opaque value, so add a ref.  */
 +scsi_req_ref(r-req);
 +bdrv_acct_start(s-qdev.conf.bs, r-acct, nb_sectors * 
 s-qdev.blocksize,
 +BDRV_ACCT_WRITE);
 +r-req.aiocb = bdrv_aio_write_zeroes(s-qdev.conf.bs,
 + r-req.cmd.lba * 
 (s-qdev.blocksize / 512),
 + nb_sectors * (s-qdev.blocksize 
 / 512),
 + flags, scsi_aio_complete, r);
 +return;
 +}
 +
 +data = g_new0(WriteSameCBData, 1);
 +data-r = r;
 +data-sector = r-req.cmd.lba * (s-qdev.blocksize / 512);
 +

Re: [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME

2013-11-19 Thread ronnie sahlberg

That means the initiator will do the split into smaller manageable
chunks for you and you get a 1-to-1 mapping between WS10/16 that the
initiator issues to qemu and the write-same calls that qemu generates.

On Tue, Nov 19, 2013 at 9:23 AM, ronnie sahlberg
ronniesahlb...@gmail.com wrote:
 +#define SCSI_WRITE_SAME_MAX 524288
 ...
 +data-iov.iov_len = MIN(data-nb_sectors * 512, SCSI_WRITE_SAME_MAX);

 I don't  think you should just clamp the data to 512k, instead I think
 you should report the 512k max write same size through
 BlockLimitsVPD/MaximumWriteSameLength to the initiator.
 Then instead of clamping the write to MIN()  you return a check
 condition if the initiator sends too much.

 On Tue, Nov 19, 2013 at 9:07 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Fetch the data to be written from the input buffer.  If it is all zeroes,
 we can use the write_zeroes call (possibly with the new MAY_UNMAP flag).
 Otherwise, do as many write cycles as needed, writing 512k at a time.

 Strictly speaking, this is still incorrect because a zero cluster should
 only be written if the MAY_UNMAP flag is set.  But this is a bug in qcow2
 and the other formats, not in the SCSI code.

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  hw/scsi/scsi-disk.c | 140 
 +++-
  1 file changed, 116 insertions(+), 24 deletions(-)

 diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
 index 0640bb0..4cc6a28 100644
 --- a/hw/scsi/scsi-disk.c
 +++ b/hw/scsi/scsi-disk.c
 @@ -41,6 +41,7 @@ do { printf(scsi-disk:  fmt , ## __VA_ARGS__); } while 
 (0)
  #include scsi/sg.h
  #endif

 +#define SCSI_WRITE_SAME_MAX 524288
  #define SCSI_DMA_BUF_SIZE   131072
  #define SCSI_MAX_INQUIRY_LEN256
  #define SCSI_MAX_MODE_LEN   256
 @@ -634,6 +635,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
 uint8_t *outbuf)
  buflen = 0x40;
  memset(outbuf + 4, 0, buflen - 4);

 +outbuf[4] = 0x1; /* wsnz */
 +
  /* optimal transfer length granularity */
  outbuf[6] = (min_io_size  8)  0xff;
  outbuf[7] = min_io_size  0xff;
 @@ -1589,6 +1592,111 @@ invalid_field:
  scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
  }

 +typedef struct WriteSameCBData {
 +SCSIDiskReq *r;
 +int64_t sector;
 +int nb_sectors;
 +QEMUIOVector qiov;
 +struct iovec iov;
 +} WriteSameCBData;
 +
 +static void scsi_write_same_complete(void *opaque, int ret)
 +{
 +WriteSameCBData *data = opaque;
 +SCSIDiskReq *r = data-r;
 +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r-req.dev);
 +
 +assert(r-req.aiocb != NULL);
 +r-req.aiocb = NULL;
 +bdrv_acct_done(s-qdev.conf.bs, r-acct);
 +if (r-req.io_canceled) {
 +goto done;
 +}
 +
 +if (ret  0) {
 +if (scsi_handle_rw_error(r, -ret)) {
 +goto done;
 +}
 +}
 +
 +data-nb_sectors -= data-iov.iov_len / 512;
 +data-sector += data-iov.iov_len / 512;
 +data-iov.iov_len = MIN(data-nb_sectors * 512, data-iov.iov_len);
 +if (data-iov.iov_len) {
 +bdrv_acct_start(s-qdev.conf.bs, r-acct, data-iov.iov_len, 
 BDRV_ACCT_WRITE);
 +r-req.aiocb = bdrv_aio_writev(s-qdev.conf.bs, data-sector,
 +   data-qiov, data-iov.iov_len / 512,
 +   scsi_write_same_complete, r);
 +return;
 +}
 +
 +scsi_req_complete(r-req, GOOD);
 +
 +done:
 +if (!r-req.io_canceled) {
 +scsi_req_unref(r-req);
 +}
 +g_free(data-iov.iov_base);
 +g_free(data);
 +}
 +
 +static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
 +{
 +SCSIRequest *req = r-req;
 +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req-dev);
 +uint32_t nb_sectors = scsi_data_cdb_length(r-req.cmd.buf);
 +WriteSameCBData *data;
 +uint8_t *buf;
 +int i;
 +
 +/* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
 +if (nb_sectors == 0 || (req-cmd.buf[1]  0x16)) {
 +scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 +return;
 +}
 +
 +if (bdrv_is_read_only(s-qdev.conf.bs)) {
 +scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
 +return;
 +}
 +if (!check_lba_range(s, r-req.cmd.lba, nb_sectors)) {
 +scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 +return;
 +}
 +
 +if (buffer_is_zero(inbuf, s-qdev.blocksize)) {
 +int flags = (req-cmd.buf[1]  0x8) ? BDRV_REQ_MAY_UNMAP : 0;
 +
 +/* The request is used as the AIO opaque value, so add a ref.  */
 +scsi_req_ref(r-req);
 +bdrv_acct_start(s-qdev.conf.bs, r-acct, nb_sectors * 
 s-qdev.blocksize,
 +BDRV_ACCT_WRITE);
 +r-req.aiocb = bdrv_aio_write_zeroes(s-qdev.conf.bs,
 + r-req.cmd.lba * 
 (s-qdev.blocksize / 512

Re: [Qemu-devel] [PATCHv7 00/17] block: logical block provisioning enhancements

2013-11-08 Thread ronnie sahlberg

For better support for older versions of libiscsi
I think it would be good to convert all the
iscsi_unmap_task/iscsi_writesame*_task/iscsi_*_task   functions with
calls to the much more genric iscsi_scsi_command_sync().

iscsi_scsi_command_sync() and iscsi_scsi_command_async() have been
available since prior to version 1.1 and can be used to send arbitrary
scsi opcodes to the target.


iscsi_scsi_command_async() is already used instead of
iscs_read16/write16_async() since the read16/write16 helpers were
added to libiscsi at a much later stage and there are examples of its
use.


Using iscsi_scsi_command_[a]sync() instead of for example
iscsi_unmap_task() would mean that you can use the UNMAP opcode
always, regardless version of libiscsi.
It would mean that you need to build the CDB directly inside
block/iscsi.c but that is not hard.


I.e.  change block/iscsi.c to only use iscsi_scsi_command_[a]sync()
for all opcodes it wants to send to the target.




On Wed, Nov 6, 2013 at 1:16 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 06/11/2013 20:38, Peter Lieven ha scritto:

 Am 06.11.2013 um 14:09 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 06/11/2013 14:08, Stefan Hajnoczi ha scritto:
 Thanks, applied to my block-next tree:
 https://github.com/stefanha/qemu/commits/block-next

 This will go into QEMU 1.8.  Since it's a new feature that touches core
 block layer code and several block drivers, I can't merge it into
 1.7-rc.

 Yay!  With these patches I can properly emulate WRITE SAME in the SCSI
 layer.

 Yes, you can, but only when write same is used to write zeroes. The other
 case is still unhandled or needs emulation.

 WRITE SAME with UNMAP is incorrect in the current QEMU code.  Your patch
 let me fix it so that the payload can be read back with no change, *and*
 sectors are unmapped when possible.

 Paolo

Re: [Qemu-devel] [PATCHv7 00/17] block: logical block provisioning enhancements

2013-11-08 Thread ronnie sahlberg

It would mean that any version of libiscsi from 1.1 or later would
work   and there would not be the issues such as
UNMAP is only available from 1.2 and forward,   WRITESAME* is only
availabel from 1.3.
SANITIZE only being available from 1.9 ...


On Fri, Nov 8, 2013 at 9:52 AM, ronnie sahlberg
ronniesahlb...@gmail.com wrote:
 For better support for older versions of libiscsi
 I think it would be good to convert all the
 iscsi_unmap_task/iscsi_writesame*_task/iscsi_*_task   functions with
 calls to the much more genric iscsi_scsi_command_sync().

 iscsi_scsi_command_sync() and iscsi_scsi_command_async() have been
 available since prior to version 1.1 and can be used to send arbitrary
 scsi opcodes to the target.


 iscsi_scsi_command_async() is already used instead of
 iscs_read16/write16_async() since the read16/write16 helpers were
 added to libiscsi at a much later stage and there are examples of its
 use.


 Using iscsi_scsi_command_[a]sync() instead of for example
 iscsi_unmap_task() would mean that you can use the UNMAP opcode
 always, regardless version of libiscsi.
 It would mean that you need to build the CDB directly inside
 block/iscsi.c but that is not hard.


 I.e.  change block/iscsi.c to only use iscsi_scsi_command_[a]sync()
 for all opcodes it wants to send to the target.




 On Wed, Nov 6, 2013 at 1:16 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 06/11/2013 20:38, Peter Lieven ha scritto:

 Am 06.11.2013 um 14:09 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 06/11/2013 14:08, Stefan Hajnoczi ha scritto:
 Thanks, applied to my block-next tree:
 https://github.com/stefanha/qemu/commits/block-next

 This will go into QEMU 1.8.  Since it's a new feature that touches core
 block layer code and several block drivers, I can't merge it into
 1.7-rc.

 Yay!  With these patches I can properly emulate WRITE SAME in the SCSI
 layer.

 Yes, you can, but only when write same is used to write zeroes. The other
 case is still unhandled or needs emulation.

 WRITE SAME with UNMAP is incorrect in the current QEMU code.  Your patch
 let me fix it so that the payload can be read back with no change, *and*
 sectors are unmapped when possible.

 Paolo

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-19 Thread ronnie sahlberg

On Thu, Jul 18, 2013 at 11:08 PM, Peter Lieven p...@kamp.de wrote:
 On 19.07.2013 07:58, Paolo Bonzini wrote:

 Il 18/07/2013 21:28, Peter Lieven ha scritto:

 thanks for the details. I think to have optimal performance and best
 change for unmapping in qemu-img convert
 it might be best to export the OPTIMAL UNMAP GRANULARITY

 Agreed about this.

 as well as the write_zeroes_w_discard capability via the BDI

 But why this?!?  It is _not_ needed.  All you need is to change the
 default of the -S option to be the OPTIMAL UNMAP GRANULARITY if it is
 nonzero.

 2 reasons:
 a) Does this guarantee that the requests are aligned to multiple of the -S
 value?

 b) If I this flag exists qemu-img convert can do the zeroing a priori.
 This
 has the benefit that combined with bdrv_get_block_status requests before it
 might
 not need to touch large areas of the volume. Speaking for iSCSI its likely
 that
 the user sets a fresh volume as the destination, but its not guaranteed.
 With the Patch 4 there are only done a few get_block_status requests to
 verify
 this. If we just write zeroes with BDRV_MAY_UNMAP, we send hundreds or
 thousands of writesame requests for possibly already unmapped data.

 To give an example. If I take my storage with an 1TB volume. It takes about
 10-12
 get_block_status requests to verify that it is completely unmapped. After
 this
 I am safe to set has_zero_init = 1 in qemu-img convert.

 If I would convert a 1TB image to this target where lets say 50% are at leat
 15MB
 zero blocks (15MB is the OUG of my storage) it would take ~35000 write same
 requests to achieve the same.

I am not sure I am reading this right, but you dont have to writesame
exactly 1xOUG to get it to unmap.
nxOUG will work too,
So instead of sending one writesame for each OUG range, you can send
one writesame for every ~10G or so.
Say 10G is ~667 OUGs for your case, so you can send
writesame for ~667xOUG in each command and then it would only take
~100 writesames instead of ~35000.

So as long as you are sending in multiples of OUG you should be fine.



 Peter


 Paolo

 and than zero
 out the whole device with bdrv_write_zeroes and the BDRV_MAY_DISCARD
 flag.

 the logic for this is already prepared in patch4 (topic of this email).
 except that
 i have to exchange bdrv_discard with bdrv_write_zeroes w/
 BDRV_MAY_DISCARD.

 what do you think?




 On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven p...@kamp.de wrote:

 Am 18.07.2013 um 16:35 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 18/07/2013 16:32, Peter Lieven ha scritto:

 (Mis)alignment and granularity can be handled later.  We can ignore
 them
 for now.  Later, if we decide the best way to support them is a
 flag,
 we'll add it.  Let's not put the cart before the horse.

 BTW, I expect alignment!=0 to be really, really rare.

 To explain my concerns:

 I know that my target has internal page size of 15MB. I will check
 what
 happens
 if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
 unprovisioned
 after the last chunk is unmapped it would be fine :-)

 You're talking of granularity here, not (mis)alignment.

 you are right. for the target i am talking about this is 30720 512-byte
 blocks for the granularity (pagesize) and 0 for the alignment.
 i will see what happens if I write same w/unmap the whole 30720 blocks
 in smaller blocks ;-) otherwise i will have to add support for honoring 
 this
 values in qemu-img convert
 as a follow up.

 Peter





 --

 Mit freundlichen Grüßen

 Peter Lieven

 ...

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   p...@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

 ...

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-19 Thread ronnie sahlberg

On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven p...@kamp.de wrote:

 Am 18.07.2013 um 16:35 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 18/07/2013 16:32, Peter Lieven ha scritto:

 (Mis)alignment and granularity can be handled later.  We can ignore them
 for now.  Later, if we decide the best way to support them is a flag,
 we'll add it.  Let's not put the cart before the horse.

 BTW, I expect alignment!=0 to be really, really rare.
 To explain my concerns:

 I know that my target has internal page size of 15MB. I will check what
 happens
 if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
 unprovisioned
 after the last chunk is unmapped it would be fine :-)

 You're talking of granularity here, not (mis)alignment.

 you are right. for the target i am talking about this is 30720 512-byte 
 blocks for the granularity (pagesize) and 0 for the alignment.
 i will see what happens if I write same w/unmap the whole 30720 blocks in 
 smaller blocks ;-)

If you write in smaller than OUG chunks, whether or not the phusical
block becomes unmapped is I think implementation defined. A target is
allowed to make this unmapped but not required to.

For example if you do two writesame16 for your 30720 chunk :
1, writesame16 lba:0 tl:3
2, writesame16 lba:3 tl:720

then

SBC 4.7.3.4
===
A WRITE SAME command shall not cause an LBA to become unmapped if
unmapping that LBA creates a
case in which a subsequent read of that unmapped LBA mayis able to
return user data or protection
informationlogical block data that differs from the Data-Out Buffer
for that WRITE SAME command. The
protection information returned by a read of an unmapped LBA is set to
___h
(see 4.7.4.5).
===

during processing of the second writesame of the physical block, since
the physical block is now all zero once this writesame16 completes,  a
target may trigger the whole physical block to become unmapped,
eventhough this specific writesame16 was only for a fraction of the
block.
It is allowed to unmap the block, but not required to.


or it could do it at a later time. For example, a target is allowed to
have say a background scan-and-unmap process that goes through the
disk and unmaps all all-zero physical blocks :
I dont know if your target would do this. Would be neat to add this to STGT.

SBC 4.7.3.5 :
===
If the LBPRZ bit in the READ CAPACITY (16) parameter data (see 5.16.2)
is set to one, and a mapped LBA
references a logical block that contains:
a) user data with all bits set to zero; and
b) protection information, if any, set to ___h,
then the device server may transition that mapped LBA to anchored or
deallocated at any time.
===

I.e. a target is allowed at any time to automatically unmap physical
blocks as long as the block is all zero and lbprz is set.


 otherwise i will have to add support for honoring this values in qemu-img 
 convert
 as a follow up.

 Peter

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-19 Thread ronnie sahlberg

On Fri, Jul 19, 2013 at 6:49 AM, Peter Lieven p...@kamp.de wrote:
 On 19.07.2013 15:25, ronnie sahlberg wrote:

 On Thu, Jul 18, 2013 at 11:08 PM, Peter Lieven p...@kamp.de wrote:

 On 19.07.2013 07:58, Paolo Bonzini wrote:

 Il 18/07/2013 21:28, Peter Lieven ha scritto:

 thanks for the details. I think to have optimal performance and best
 change for unmapping in qemu-img convert
 it might be best to export the OPTIMAL UNMAP GRANULARITY

 Agreed about this.

 as well as the write_zeroes_w_discard capability via the BDI

 But why this?!?  It is _not_ needed.  All you need is to change the
 default of the -S option to be the OPTIMAL UNMAP GRANULARITY if it is
 nonzero.

 2 reasons:
 a) Does this guarantee that the requests are aligned to multiple of the
 -S
 value?

 b) If I this flag exists qemu-img convert can do the zeroing a priori.
 This
 has the benefit that combined with bdrv_get_block_status requests before
 it
 might
 not need to touch large areas of the volume. Speaking for iSCSI its
 likely
 that
 the user sets a fresh volume as the destination, but its not guaranteed.
 With the Patch 4 there are only done a few get_block_status requests to
 verify
 this. If we just write zeroes with BDRV_MAY_UNMAP, we send hundreds or
 thousands of writesame requests for possibly already unmapped data.

 To give an example. If I take my storage with an 1TB volume. It takes
 about
 10-12
 get_block_status requests to verify that it is completely unmapped. After
 this
 I am safe to set has_zero_init = 1 in qemu-img convert.

 If I would convert a 1TB image to this target where lets say 50% are at
 leat
 15MB
 zero blocks (15MB is the OUG of my storage) it would take ~35000 write
 same
 requests to achieve the same.

 I am not sure I am reading this right, but you dont have to writesame
 exactly 1xOUG to get it to unmap.
 nxOUG will work too,
 So instead of sending one writesame for each OUG range, you can send
 one writesame for every ~10G or so.
 Say 10G is ~667 OUGs for your case, so you can send
 writesame for ~667xOUG in each command and then it would only take
 ~100 writesames instead of ~35000.

 So as long as you are sending in multiples of OUG you should be fine.

 do I not have to take care of max_ws_size?

Yes you need to handle mas_ws_size   but I would imagine that on most
targets that max_ws_size  OUG
I would be surprised if a target would set max_ws_size to just a single OUG.



 Peter

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-18 Thread ronnie sahlberg

On Thu, Jul 18, 2013 at 6:29 AM, Peter Lieven p...@kamp.de wrote:
 On 18.07.2013 14:31, Paolo Bonzini wrote:

 Il 18/07/2013 13:04, Peter Lieven ha scritto:

 But if you set BDRV_DISCARD_WRITE_ZEROES, then you always need a
 fallback to bdrv_write_zeroes.  Why not just call bdrv_write_zeroes to
 begin with?  That's why extending bdrv_write_zeroes is preferable.

 In this case wo do not need a flag to the function at all. If the
 driver sets bdi-write_zeroes_w_discard = 1 then bdrv_write_zeroes
 can use bdrv_discard to write zeroes and the driver has to
 ensure that all is zero afterwards.

 Peter, you removed exactly the part of the email where I explained the
 wrong part of your reasoning:

 you cannot do that [discard in bdrv_write_zeroes] unconditionally.
 Some operations can use it, some cannot.  Think of SCSI disk
 emulation: it must not discard is WRITE SAME is sent without the
 UNMAP bit!

 If the driver would have a better method of writing zeroes than
 discard it simply should not set bdi-write_zeroes_w_discard = 1.

 If the driver had a better method of writing zeroes than discard, it
 simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
 its bdrv_write_zeros implementation.

 ok, but this would require an individual patch in every driver, wouldn't it.
 i am ok with that.

 the BDRV_MAY_DISCARD flag is at the end a hint if the driver can optimize
 writing zeroes by a discard or if real zeroes should be written e.g. to
 sanitize the device?

 talking for iscsi:
 bdrv-discard can remain to use UNMAP and silently fail if lbpu == 0.
 bdrv-write_zeroes will use writesame16 and set the unmap flag only if
 BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.

When you use WRITESAME16 you can ignore the lbprz flag.
Just send a WRITESAME16 command with one block of data that is all set to zero.
If the unmap flag is set and if unmapped blocks read back the same as
the block you provided (all zero)
then it will unmap those blocks, where possible.
All other blocks that can not be unmapped, or where unmapped blocks
will not read back the same, will instead be overwritten by the
provided all-zero block.


 in case lbpws == 0 it will return -ENOSUP.

 Correct?

 Peter

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-18 Thread ronnie sahlberg

BlockLimitsVPD  OptimalUnmapGranularity also applies to unmapping with
writesame16 :

An OPTIMAL UNMAP GRANULARITY field set to a non-zero value indicates
the optimal granularity in logical blocks
for unmap requests (e.g., an UNMAP command or a WRITE SAME (16)
command with the UNMAP bit set to
one). An unmap request with a number of logical blocks that is not a
multiple of this value may result in unmap
operations on fewer LBAs than requested. An OPTIMAL UNMAP GRANULARITY
field set to _h indicates
that the device server does not report an optimal unmap granularity.

So if you send a writesame16+unmap-bit  that honours the optimal
unmap request you have a pretty good chance
that the blocks will be unmapped. If they are not they will at least
be overwritten as zero.


On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven p...@kamp.de wrote:

 Am 18.07.2013 um 16:35 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 18/07/2013 16:32, Peter Lieven ha scritto:

 (Mis)alignment and granularity can be handled later.  We can ignore them
 for now.  Later, if we decide the best way to support them is a flag,
 we'll add it.  Let's not put the cart before the horse.

 BTW, I expect alignment!=0 to be really, really rare.
 To explain my concerns:

 I know that my target has internal page size of 15MB. I will check what
 happens
 if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
 unprovisioned
 after the last chunk is unmapped it would be fine :-)

 You're talking of granularity here, not (mis)alignment.

 you are right. for the target i am talking about this is 30720 512-byte 
 blocks for the granularity (pagesize) and 0 for the alignment.
 i will see what happens if I write same w/unmap the whole 30720 blocks in 
 smaller blocks ;-) otherwise i will have to add support for honoring this 
 values in qemu-img convert
 as a follow up.

 Peter

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-17 Thread ronnie sahlberg

On Wed, Jul 17, 2013 at 3:25 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 17.07.2013 um 11:58 hat Paolo Bonzini geschrieben:
 Il 17/07/2013 10:46, Kevin Wolf ha scritto:
  Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
  if a destination has has_zero_init = 0, but it supports
  discard zeroes use discard to convert the target
  into an all zero device.
 
  Signed-off-by: Peter Lieven p...@kamp.de
 
  Wouldn't it be better to use bdrv_write_zeroes() and extend the
  implementation of that to use discard internally in those block drivers
  where it makes sense?
 
  Because here you're not really discarding (i.e. don't care about the
  sectors any more), but you want them to be zeroed.

 I thought the same yesterday when reviewing the series, but I'm not
 convinced.

 Discarding is not always the right way to write zeroes, because it can
 disrupt performance.  It may be fine when you are already going to write
 a sparse image (as is the case for qemu-img convert), but not in
 general.  So if you just used write_zeroes, it would have to fall under
 yet another -drive option (or an extension to -drive discard).

 Maybe we need a flag to bdrv_write_zeroes() that specifies whether to
 use discard or not, kind of like the unmap bit in WRITE SAME.

 On the other hand, I think you're right that this is really policy,
 and as such shouldn't be hardcoded based on our guesses, but be
 configurable.

 I think what Peter did is a good compromise in the end.

 At the very least it must become a separate function. img_convert() is
 already too big and too deeply nested.

 BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
 zeroes blocks, but is that true for unaligned operations?

 SBC-3 Rev. 35e, 5.16 READ CAPACITY (16) command:

 A logical block provisioning read zeros (LBPRZ) bit set to one
 indicates that, for read commands specifying an unmapped LBA (see
 4.7.4.5), the device server returns user data set to zero [...]

 So it depends on the block provisioning state of the LBA, not on the
 operations that were performed on it.

 5.28 UNMAP command:

 If the ANCHOR bit in the CDB is set to zero, and the logical unit is
 thin provisioned (see 4.7.3.3), then the logical block provisioning
 state for each specified LBA:

 a) should become deallocated;
 b) may become anchored; or
 c) may remain unchanged.

 So with UNMAP, I think you don't have any guarantees that the LBA
 becomes unmapped and therefore zeroed. It could just keep its current
 data. No matter whether your request was aligned or not.


If you check the LogicalBlockLimits VPD page it has :

---
The UNMAP GRANULARITY ALIGNMENT field indicates the LBA of the first
logical block to which the OPTIMAL
field applies. The unmap granularity alignment is used to calculate an
optimal unmap
request starting LBA as follows:
UNMAP GRANULARITY
optimal unmap request starting LBA = (n × optimal unmap granularity) +
unmap granularity alignment
where:
n is zero or any positive integer value;
optimal unmap granularity is the value in the OPTIMAL UNMAP
GRANULARITY field; and
unmap granularity alignment is the value in the UNMAP GRANULARITY
ALIGNMENT field.
An unmap request with a starting LBA that is not optimal may result in
unmap operations on fewer LBAs than
requested.
---

So if Peter checks OPTIMAL_UNMAP_GRANULARITY and
UNMAP_GRANULARITY_ALIGNMENT and make sure that
the starting LBA is   (n * OUG) + UGA   and that the number of blocks
is a multiple of OUG then you should be fine.
Those unmaps should not silently fail.


I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
optimal unmap request then the blocks will become unmapped and they
will
read back as 0.


(
Using this check should also make it safe with UNMAP for 4k block
devices that emulate 512b blocks.
Those devices should report OUG==8 accordingly so you dont need to
check what LogicalBlocksPerPhysicalBlockExponent is.
)

regards
ronnie sahlberg

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-17 Thread ronnie sahlberg

On Wed, Jul 17, 2013 at 9:27 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
 I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
 optimal unmap request then the blocks will become unmapped and they
 will read back as 0.

 Yes, but it is not reasonable to assume that bdrv_discard will only
 receive optimal requests.  Thus, using WRITE SAME for LBPRZ=1, and not
 exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
 capacity), is the safer thing to do.

 Paolo

ACK.

WRITESAME10/16 with UNMAP flag set is probably best.

Re: [Qemu-devel] [PATCH 4/4] qemu-img: conditionally discard target on convert

2013-07-17 Thread ronnie sahlberg

On Wed, Jul 17, 2013 at 10:02 AM, Peter Lieven p...@kamp.de wrote:
 For Disks we always use read/write16 so i think we Should also use 
 writesame16. Or not?

Sounds good.



 Von meinem iPhone gesendet

 Am 17.07.2013 um 18:31 schrieb ronnie sahlberg ronniesahlb...@gmail.com:

 On Wed, Jul 17, 2013 at 9:27 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
 I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
 optimal unmap request then the blocks will become unmapped and they
 will read back as 0.

 Yes, but it is not reasonable to assume that bdrv_discard will only
 receive optimal requests.  Thus, using WRITE SAME for LBPRZ=1, and not
 exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
 capacity), is the safer thing to do.

 Paolo

 ACK.

 WRITESAME10/16 with UNMAP flag set is probably best.

Re: [Qemu-devel] [PATCHv2 02/11] iscsi: read unmap info from block limits vpd page

2013-07-06 Thread ronnie sahlberg

Right.

I don't see any problem with your patch.


On Sat, Jul 6, 2013 at 3:15 PM, Peter Lieven p...@kamp.de wrote:
 ok, to sum up you see no potential problem with my patch to optimize write 
 zeroes by
 unmap iff lbprz==1 and lbpme == 1 ?

ACK



 the alternative would be to use writesame16 and sent a zero block. this would 
 allow
 an optimization also if lbprz == 0. in this case i would not set the unmap 
 bit.

 peter


 Am 05.07.2013 um 09:11 schrieb ronnie sahlberg ronniesahlb...@gmail.com:

 The device MIGHT map or anchor the blocks after the unmap  but it may
 only do so if the blocks that become mapped are all zero.

 So I think you can  safely assume that if lbprz==1 then it will always
 read back as zero no matter what happens internally in the target.

 Either the block becomes unmapped/deallocated and then it will read
 back as zero,
 or the blocks is automatically re-mapped to anchored/mapped again
 but this can only happen if the mapped block is all zero so once again
 it will still read back as all zero.

 ===

 4.7.3.5 Autonomous LBA transitions
 A device server may perform the following actions at any time:
 a) transition any deallocated LBA to mapped;
 b) transition any anchored LBA to mapped; or
 c) transition any deallocated LBA to anchored.
 If the LBPRZ bit in the READ CAPACITY (16) parameter data (see 5.16.2)
 is set to one, and a mapped LBA
 references a logical block that contains:
 a) user data with all bits set to zero; and
 Working Draft SCSI Block Commands – 3 (SBC-3)
 27T10/BSR INCITS 514 Revision 35d
 15 May 2013
 b) protection information, if any, set to ___h,
 then the device server may transition that mapped LBA to anchored or
 deallocated at any time.
 The logical block provisioning st

 On Thu, Jul 4, 2013 at 2:07 PM, Peter Lieven p...@kamp.de wrote:

 Am 04.07.2013 um 14:37 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 03/07/2013 23:23, Peter Lieven ha scritto:
 BDC is not used. I had an implementation that sent multiple descriptors 
 out, but
 at least for my storage the maximum unmap counts not for each 
 descriptors, but for all
 together. So in this case we do not need the field at all. I forgot to 
 remove it.

 discard and write_zeroes will both only send one request up to max_unmap 
 in size.

 apropos write_zeroes: do you know if UNMAP is guaranteed to unmap data if 
 lbprz == 1?

 Yes.  On the other hand note that WRITE_SAME should be guaranteed _not_
 to unmap if lbprz == 0 and you do WRITE_SAME with UNMAP and a zero
 payload, but I suspect there may be buggy targets here.

 I have read in the specs something that the target might unmap the blocks 
 or not touch them at all.
 Maybe you have more information.

 That's even true of UNMAP itself, actually. :)

 The storage can always upgrade a block from unmapped to anchored and
 from anchored to allocated, so UNMAP can be a no-op and still comply
 with the standard.

 My concern was, if I UNMAP a block and lbprz == 1 is it guaranteed that it 
 reads
 as zero afterwards? Regardless if the target decides to upgrade the block 
 or do
 not unmap the block?

 Peter

Re: [Qemu-devel] [PATCHv2 02/11] iscsi: read unmap info from block limits vpd page

2013-07-05 Thread ronnie sahlberg

The device MIGHT map or anchor the blocks after the unmap  but it may
only do so if the blocks that become mapped are all zero.

So I think you can  safely assume that if lbprz==1 then it will always
read back as zero no matter what happens internally in the target.

Either the block becomes unmapped/deallocated and then it will read
back as zero,
or the blocks is automatically re-mapped to anchored/mapped again
but this can only happen if the mapped block is all zero so once again
it will still read back as all zero.

===

4.7.3.5 Autonomous LBA transitions
A device server may perform the following actions at any time:
a) transition any deallocated LBA to mapped;
b) transition any anchored LBA to mapped; or
c) transition any deallocated LBA to anchored.
If the LBPRZ bit in the READ CAPACITY (16) parameter data (see 5.16.2)
is set to one, and a mapped LBA
references a logical block that contains:
a) user data with all bits set to zero; and
Working Draft SCSI Block Commands – 3 (SBC-3)
27T10/BSR INCITS 514 Revision 35d
15 May 2013
b) protection information, if any, set to ___h,
then the device server may transition that mapped LBA to anchored or
deallocated at any time.
The logical block provisioning st

On Thu, Jul 4, 2013 at 2:07 PM, Peter Lieven p...@kamp.de wrote:

 Am 04.07.2013 um 14:37 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 03/07/2013 23:23, Peter Lieven ha scritto:
 BDC is not used. I had an implementation that sent multiple descriptors 
 out, but
 at least for my storage the maximum unmap counts not for each descriptors, 
 but for all
 together. So in this case we do not need the field at all. I forgot to 
 remove it.

 discard and write_zeroes will both only send one request up to max_unmap in 
 size.

 apropos write_zeroes: do you know if UNMAP is guaranteed to unmap data if 
 lbprz == 1?

 Yes.  On the other hand note that WRITE_SAME should be guaranteed _not_
 to unmap if lbprz == 0 and you do WRITE_SAME with UNMAP and a zero
 payload, but I suspect there may be buggy targets here.

 I have read in the specs something that the target might unmap the blocks 
 or not touch them at all.
 Maybe you have more information.

 That's even true of UNMAP itself, actually. :)

 The storage can always upgrade a block from unmapped to anchored and
 from anchored to allocated, so UNMAP can be a no-op and still comply
 with the standard.

 My concern was, if I UNMAP a block and lbprz == 1 is it guaranteed that it 
 reads
 as zero afterwards? Regardless if the target decides to upgrade the block 
 or do
 not unmap the block?

 Peter

Re: [Qemu-devel] [PATCHv2 02/11] iscsi: read unmap info from block limits vpd page

2013-07-02 Thread ronnie sahlberg

max_unmap :

If the target does not return an explicit limit for max_unmap it will
return 0x which means no limit.

I think you should add a check for max_unmap and clamp it down to
something sane.
Maybe a maximum of 128M ?

Same for bdc, that should also be checked and clamped down to sane values.


On Thu, Jun 27, 2013 at 11:11 PM, Peter Lieven p...@kamp.de wrote:
 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |   80 
 -
  1 file changed, 56 insertions(+), 24 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index a38a1bf..2e2455d 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -54,6 +54,8 @@ typedef struct IscsiLun {
  uint8_t lbpu;
  uint8_t lbpws;
  uint8_t lbpws10;
 +uint32_t max_unmap;
 +uint32_t max_unmap_bdc;
  } IscsiLun;

  typedef struct IscsiAIOCB {
 @@ -1007,6 +1009,37 @@ static QemuOptsList runtime_opts = {
  },
  };

 +static struct scsi_task *iscsi_do_inquiry(struct iscsi_context *iscsi,
 +  int lun, int evpd, int pc) {
 +int full_size;
 +struct scsi_task *task = NULL;
 +task = iscsi_inquiry_sync(iscsi, lun, evpd, pc, 64);
 +if (task == NULL || task-status != SCSI_STATUS_GOOD) {
 +goto fail;
 +}
 +full_size = scsi_datain_getfullsize(task);
 +if (full_size  task-datain.size) {
 +scsi_free_scsi_task(task);
 +
 +/* we need more data for the full list */
 +task = iscsi_inquiry_sync(iscsi, lun, evpd, pc, full_size);
 +if (task == NULL || task-status != SCSI_STATUS_GOOD) {
 +goto fail;
 +}
 +}
 +
 +return task;
 +
 +fail:
 +error_report(iSCSI: Inquiry command failed : %s,
 + iscsi_get_error(iscsi));
 +if (task) {
 +scsi_free_scsi_task(task);
 +return NULL;
 +}
 +return NULL;
 +}
 +
  /*
   * We support iscsi url's on the form
   * iscsi://[username%password@]host[:port]/targetname/lun
 @@ -1139,33 +1172,12 @@ static int iscsi_open(BlockDriverState *bs, QDict 
 *options, int flags)

  if (iscsilun-lbpme) {
  struct scsi_inquiry_logical_block_provisioning *inq_lbp;
 -int full_size;
 -
 -task = iscsi_inquiry_sync(iscsi, iscsilun-lun, 1,
 -  
 SCSI_INQUIRY_PAGECODE_LOGICAL_BLOCK_PROVISIONING,
 -  64);
 -if (task == NULL || task-status != SCSI_STATUS_GOOD) {
 -error_report(iSCSI: Inquiry command failed : %s,
 -   iscsi_get_error(iscsilun-iscsi));
 +task = iscsi_do_inquiry(iscsilun-iscsi, iscsilun-lun, 1,
 +
 SCSI_INQUIRY_PAGECODE_LOGICAL_BLOCK_PROVISIONING);
 +if (task == NULL) {
  ret = -EINVAL;
  goto out;
  }
 -full_size = scsi_datain_getfullsize(task);
 -if (full_size  task-datain.size) {
 -scsi_free_scsi_task(task);
 -
 -/* we need more data for the full list */
 -task = iscsi_inquiry_sync(iscsi, iscsilun-lun, 1,
 -  
 SCSI_INQUIRY_PAGECODE_LOGICAL_BLOCK_PROVISIONING,
 -  full_size);
 -if (task == NULL || task-status != SCSI_STATUS_GOOD) {
 -error_report(iSCSI: Inquiry command failed : %s,
 - iscsi_get_error(iscsilun-iscsi));
 -ret = -EINVAL;
 -goto out;
 -}
 -}
 -
  inq_lbp = scsi_datain_unmarshall(task);
  if (inq_lbp == NULL) {
  error_report(iSCSI: failed to unmarshall inquiry datain blob);
 @@ -1175,6 +1187,26 @@ static int iscsi_open(BlockDriverState *bs, QDict 
 *options, int flags)
  iscsilun-lbpu = inq_lbp-lbpu;
  iscsilun-lbpws = inq_lbp-lbpws;
  iscsilun-lbpws10 = inq_lbp-lbpws10;
 +scsi_free_scsi_task(task);
 +task = NULL;
 +}
 +
 +if (iscsilun-lbpu) {
 +struct scsi_inquiry_block_limits *inq_bl;
 +task = iscsi_do_inquiry(iscsilun-iscsi, iscsilun-lun, 1,
 +SCSI_INQUIRY_PAGECODE_BLOCK_LIMITS);
 +if (task == NULL) {
 +ret = -EINVAL;
 +goto out;
 +}
 +inq_bl = scsi_datain_unmarshall(task);
 +if (inq_bl == NULL) {
 +error_report(iSCSI: failed to unmarshall inquiry datain blob);
 +ret = -EINVAL;
 +goto out;
 +}
 +iscsilun-max_unmap = inq_bl-max_unmap;
 +iscsilun-max_unmap_bdc = inq_bl-max_unmap_bdc;
  }

  #if defined(LIBISCSI_FEATURE_NOP_COUNTER)
 --
 1.7.9.5

[Qemu-devel] [PATCH] iSCSI fix crash when using virtio and libiscsi V3

2013-06-23 Thread Ronnie Sahlberg

List,

Please find a new version of the patch to fix the iSCSI crash when ioctl with 
iovector is sent.

Updated to fix the commit message as per lerseks suggestion.
Also added an explicit cast to suppress a compiler warning when we dont have 
iovector support available.

[Qemu-devel] [PATCH] Fix iSCSI crash on SG_IO with an iovector

2013-06-23 Thread Ronnie Sahlberg

Don't assume that SG_IO is always invoked with a simple buffer,
check the iovec_count and if it is = 1 then we need to pass an array
of iovectors to libiscsi instead of just a plain buffer.

Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
---
 block/iscsi.c |   56 +---
 1 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 0bbf0b1..dca38c4 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -651,6 +651,9 @@ iscsi_aio_ioctl_cb(struct iscsi_context *iscsi, int status,
 {
 IscsiAIOCB *acb = opaque;
 
+g_free(acb-buf);
+acb-buf = NULL;
+
 if (acb-canceled != 0) {
 return;
 }
@@ -727,14 +730,36 @@ static BlockDriverAIOCB *iscsi_aio_ioctl(BlockDriverState 
*bs,
 memcpy(acb-task-cdb[0], acb-ioh-cmdp, acb-ioh-cmd_len);
 acb-task-expxferlen = acb-ioh-dxfer_len;
 
+data.size = 0;
 if (acb-task-xfer_dir == SCSI_XFER_WRITE) {
-data.data = acb-ioh-dxferp;
-data.size = acb-ioh-dxfer_len;
+if (acb-ioh-iovec_count == 0) {
+data.data = acb-ioh-dxferp;
+data.size = acb-ioh-dxfer_len;
+} else {
+#if defined(LIBISCSI_FEATURE_IOVECTOR)
+ scsi_task_set_iov_out(acb-task,
+  (struct scsi_iovec *) acb-ioh-dxferp,
+  acb-ioh-iovec_count);
+ #else
+int i;
+char *buf;
+struct scsi_iovec *iov = (struct scsi_iovec *)acb-ioh-dxferp;
+
+acb-buf = g_malloc(acb-ioh-dxfer_len);
+buf = (char *)acb-buf;
+for (i = 0; i  acb-ioh-iovec_count; i++) {
+memcpy(buf, iov[i].iov_base, iov[i].iov_len);
+buf += iov[i].iov_len;
+}
+data.data = acb-buf;
+data.size = acb-ioh-dxfer_len;
+#endif
+}
 }
+
 if (iscsi_scsi_command_async(iscsi, iscsilun-lun, acb-task,
  iscsi_aio_ioctl_cb,
- (acb-task-xfer_dir == SCSI_XFER_WRITE) ?
- data : NULL,
+ (data.size  0) ? data : NULL,
  acb) != 0) {
 scsi_free_scsi_task(acb-task);
 qemu_aio_release(acb);
@@ -743,9 +768,26 @@ static BlockDriverAIOCB *iscsi_aio_ioctl(BlockDriverState 
*bs,
 
 /* tell libiscsi to read straight into the buffer we got from ioctl */
 if (acb-task-xfer_dir == SCSI_XFER_READ) {
-scsi_task_add_data_in_buffer(acb-task,
- acb-ioh-dxfer_len,
- acb-ioh-dxferp);
+if (acb-ioh-iovec_count == 0) {
+scsi_task_add_data_in_buffer(acb-task,
+ acb-ioh-dxfer_len,
+ acb-ioh-dxferp);
+} else {
+#if defined(LIBISCSI_FEATURE_IOVECTOR)
+scsi_task_set_iov_in(acb-task,
+ (struct scsi_iovec *) acb-ioh-dxferp,
+ acb-ioh-iovec_count);
+#else
+int i;
+for (i = 0; i  acb-ioh-iovec_count; i++) {
+struct scsi_iovec *iov = (struct scsi_iovec *)acb-ioh-dxferp;
+
+scsi_task_add_data_in_buffer(acb-task,
+iov[i].iov_len,
+iov[i].iov_base);
+}
+#endif
+}
 }
 
 iscsi_set_events(iscsilun);
-- 
1.7.3.1

Re: [Qemu-devel] [PATCH] iSCSI fix crash when using virtio and libiscsi

2013-06-21 Thread ronnie sahlberg

I can add the checks and resubmit.

On Fri, Jun 21, 2013 at 12:38 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 21/06/2013 04:32, Ronnie Sahlberg ha scritto:
 Stefan, List

 Please find a patch that fixes the crashes for using virtio with libiscsi.
 The problem was that block/iscsi.c always assumed we got a plain buffer to 
 read data into, and when we got an iovector array instead we would overwrite 
 pointers with garbage and crash.

 Since we can get iovectors for the write case as well I have added a fix for 
 when the guest is writing data to the target to handle the iovector case as 
 well.


 The new calls added are not protected with (LIBISCSI_FEATURE_IOVECTOR) checks
 since anyone building a new/current version of qemu should probably also 
 build
 against a current libiscsi.

 Not necessarily, you may build against an older libiscsi from the distro.

 Can you resubmit with the checks intact?

 Paolo

 I will send patches later to remove the current (LIBISCSI_FEATURE_IOVECTOR) 
 checks in the rest of the file.

Re: [Qemu-devel] [PATCH 1/2] iscsi: add support for bdrv_co_is_allocated()

2013-06-21 Thread ronnie sahlberg

Should we really mix co-routines and AIO in the same backend?

Would it not be better to instead add a new bdrb_aio_is_allocaed and
use non-blocking async calls to libiscsi ?


On Fri, Jun 21, 2013 at 2:18 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 20.06.2013 um 20:20 hat Peter Lieven geschrieben:
 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |   57 
 +
  1 file changed, 57 insertions(+)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index 0bbf0b1..e6b966d 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -49,6 +49,7 @@ typedef struct IscsiLun {
  uint64_t num_blocks;
  int events;
  QEMUTimer *nop_timer;
 +uint8_t lbpme;
  } IscsiLun;

  typedef struct IscsiAIOCB {
 @@ -800,6 +801,60 @@ iscsi_getlength(BlockDriverState *bs)
  return len;
  }

 +static int coroutine_fn iscsi_co_is_allocated(BlockDriverState *bs,
 +  int64_t sector_num,
 +  int nb_sectors, int *pnum)
 +{
 +IscsiLun *iscsilun = bs-opaque;
 +struct scsi_task *task = NULL;
 +struct scsi_get_lba_status *lbas = NULL;
 +struct scsi_lba_status_descriptor *lbasd = NULL;
 +int ret;
 +
 +*pnum = nb_sectors;
 +
 +if (iscsilun-lbpme == 0) {
 +return 1;
 +}
 +
 +/* in-flight requests could invalidate the lba status result */
 +while (iscsi_process_flush(iscsilun)) {
 +qemu_aio_wait();
 +}

 Note that you're blocking here. The preferred way would be something
 involving a yield from the coroutine and a reenter as soon as all
 requests are done. Maybe a CoRwLock does what you need?

 +
 +task = iscsi_get_lba_status_sync(iscsilun-iscsi, iscsilun-lun,
 + sector_qemu2lun(sector_num, iscsilun),
 + 8+16);

 Spacing around operators (8 + 16)

 +
 +if (task == NULL || task-status != SCSI_STATUS_GOOD) {
 +scsi_free_scsi_task(task);
 +return 1;

 Error cases should set *pnum = 0 and return 0.

 Kevin

Re: [Qemu-devel] [PATCH 1/2] iscsi: add support for bdrv_co_is_allocated()

2013-06-21 Thread ronnie sahlberg

On Fri, Jun 21, 2013 at 10:06 AM, Peter Lieven p...@kamp.de wrote:
 Am 21.06.2013 18:31, schrieb Paolo Bonzini:
 Il 21/06/2013 13:07, Kevin Wolf ha scritto:
 Note that you're blocking here. The preferred way would be something
 involving a yield from the coroutine and a reenter as soon as all
 requests are done. Maybe a CoRwLock does what you need?
 Is there a document how to use it? Or can you help here?
 The idea would be to take a read lock while any request is in flight
 (i.e. qemu_co_rwlock_rdlock() before it's started and
 qemu_co_rwlock_unlock() when it completes), and to take a write lock
 (qemu_co_rwlock_wrlock) for the part of iscsi_co_is_allocated() that
 requires that no other request runs in parallel.

 You can just send the SCSI command asynchronously and wait for the
 result.  There is an example in block/qed.c, the same would apply for iscsi.

 thanks for the pointer paolo, this was i was looking for. this here seems to 
 work:

 static void
 iscsi_co_is_allocated_cb(struct iscsi_context *iscsi, int status,
 void *command_data, void *opaque)
 {
 struct IscsiTask *iTask = opaque;
 struct scsi_task *task = command_data;
 struct scsi_get_lba_status *lbas = NULL;

 iTask-complete = 1;

 if (status != 0) {
 error_report(iSCSI: Failed to get_lba_status on iSCSI lun. %s,
  iscsi_get_error(iscsi));
 iTask-status   = 1;
 goto out;
 }

 lbas = scsi_datain_unmarshall(task);
 if (lbas == NULL) {
 iTask-status   = 1;
 goto out;
 }

 memcpy(iTask-lbasd, lbas-descriptors[0],
sizeof(struct scsi_lba_status_descriptor));

Only the first descriptor?
sector_num - sector_num+nb_sectors  could be partially allocated in
which case you get multiple descriptors



 iTask-status   = 0;

 out:
 scsi_free_scsi_task(task);

 if (iTask-co) {
 qemu_coroutine_enter(iTask-co, NULL);
 }
 }

 static int coroutine_fn iscsi_co_is_allocated(BlockDriverState *bs,
   int64_t sector_num,
   int nb_sectors, int *pnum)
 {
 IscsiLun *iscsilun = bs-opaque;
 struct IscsiTask iTask;
 int ret;

 *pnum = nb_sectors;

 if (iscsilun-lbpme == 0) {
 return 1;
 }

 iTask.iscsilun = iscsilun;
 iTask.status = 0;
 iTask.complete = 0;
 iTask.bs = bs;

 if (iscsi_get_lba_status_task(iscsilun-iscsi, iscsilun-lun,
   sector_qemu2lun(sector_num, iscsilun),
   8 + 16, iscsi_co_is_allocated_cb,
   iTask) == NULL) {
 *pnum = 0;
 return 0;
 }

 while (!iTask.complete) {
 iscsi_set_events(iscsilun);
 iTask.co = qemu_coroutine_self();
 qemu_coroutine_yield();
 }

 if (iTask.status != 0) {
 /* in case the get_lba_status_callout fails (i.e.
  * because the device is busy or the cmd is not
  * supported) we pretend all blocks are allocated
  * for backwards compatiblity */
 return 1;
 }

 if (sector_qemu2lun(sector_num, iscsilun) != iTask.lbasd.lba) {
 *pnum = 0;
 return 0;
 }

 *pnum = iTask.lbasd.num_blocks * (iscsilun-block_size / 
 BDRV_SECTOR_SIZE);
 if (*pnum  nb_sectors) {
 *pnum = nb_sectors;
 }

 return (iTask.lbasd.provisioning == SCSI_PROVISIONING_TYPE_MAPPED) ? 1 : 
 0;

 return ret;
 }

[Qemu-devel] [PATCH] iSCSI fix crash when using virtio and libiscsi V2

2013-06-21 Thread Ronnie Sahlberg

List,

Version 2 of patch to fix the crashbug for virtio and libiscsi

Make block/iscsi.c aware that we might be called not with ioh containing a 
buffer
for the data but that it might contain an array of io vectors.



Updated to check LIBISCSI_FEATURE_IOVECTOR wether we can use iovectors or not.
If not we fall back to either lots of datain buffers or a serialization dataout 
buffer.

regards
ronnie sahlberg

[Qemu-devel] [PATCH] Fix iSCSI crash on SG_IO with an iovector

2013-06-21 Thread Ronnie Sahlberg

Don't assume that SG_IO is always invoked with a simple buffer,
check the iovec_count and if it is  1 then we need to pass an array
of iovectors to libiscsi instead of just a plain buffer.

Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
---
 block/iscsi.c |   56 +---
 1 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 0bbf0b1..cbe2e8f 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -651,6 +651,9 @@ iscsi_aio_ioctl_cb(struct iscsi_context *iscsi, int status,
 {
 IscsiAIOCB *acb = opaque;
 
+g_free(acb-buf);
+acb-buf = NULL;
+
 if (acb-canceled != 0) {
 return;
 }
@@ -727,14 +730,36 @@ static BlockDriverAIOCB *iscsi_aio_ioctl(BlockDriverState 
*bs,
 memcpy(acb-task-cdb[0], acb-ioh-cmdp, acb-ioh-cmd_len);
 acb-task-expxferlen = acb-ioh-dxfer_len;
 
+data.size = 0;
 if (acb-task-xfer_dir == SCSI_XFER_WRITE) {
-data.data = acb-ioh-dxferp;
-data.size = acb-ioh-dxfer_len;
+if (acb-ioh-iovec_count == 0) {
+data.data = acb-ioh-dxferp;
+data.size = acb-ioh-dxfer_len;
+} else {
+#if defined(LIBISCSI_FEATURE_IOVECTOR)
+ scsi_task_set_iov_out(acb-task,
+  (struct scsi_iovec *) acb-ioh-dxferp,
+  acb-ioh-iovec_count);
+ #else
+int i;
+char *buf;
+struct scsi_iovec *iov = (struct scsi_iovec *)acb-ioh-dxferp;
+
+acb-buf = g_malloc(acb-ioh-dxfer_len);
+buf = acb-buf;
+for (i = 0; i  acb-ioh-iovec_count; i++) {
+memcpy(buf, iov[i].iov_base, iov[i].iov_len);
+buf += iov[i].iov_len;
+}
+data.data = acb-buf;
+data.size = acb-ioh-dxfer_len;
+#endif
+}
 }
+
 if (iscsi_scsi_command_async(iscsi, iscsilun-lun, acb-task,
  iscsi_aio_ioctl_cb,
- (acb-task-xfer_dir == SCSI_XFER_WRITE) ?
- data : NULL,
+ (data.size  0) ? data : NULL,
  acb) != 0) {
 scsi_free_scsi_task(acb-task);
 qemu_aio_release(acb);
@@ -743,9 +768,26 @@ static BlockDriverAIOCB *iscsi_aio_ioctl(BlockDriverState 
*bs,
 
 /* tell libiscsi to read straight into the buffer we got from ioctl */
 if (acb-task-xfer_dir == SCSI_XFER_READ) {
-scsi_task_add_data_in_buffer(acb-task,
- acb-ioh-dxfer_len,
- acb-ioh-dxferp);
+if (acb-ioh-iovec_count == 0) {
+scsi_task_add_data_in_buffer(acb-task,
+ acb-ioh-dxfer_len,
+ acb-ioh-dxferp);
+} else {
+#if defined(LIBISCSI_FEATURE_IOVECTOR)
+scsi_task_set_iov_in(acb-task,
+ (struct scsi_iovec *) acb-ioh-dxferp,
+ acb-ioh-iovec_count);
+#else
+int i;
+for (i = 0; i  acb-ioh-iovec_count; i++) {
+struct scsi_iovec *iov = (struct scsi_iovec *)acb-ioh-dxferp;
+
+scsi_task_add_data_in_buffer(acb-task,
+iov[i].iov_len,
+iov[i].iov_base);
+}
+#endif
+}
 }
 
 iscsi_set_events(iscsilun);
-- 
1.7.3.1

Re: [Qemu-devel] [Bug 1191606] Re: qemu crashes with iscsi initiator (libiscsi) when using virtio

2013-06-20 Thread ronnie sahlberg

http://pastebin.com/EuwZPna1

Last few thousand lines from the log with your patch.


The crash happens immediately after qemu has called out to iscsi_ioctl
with SG_IO to read the serial numbers vpd page.
We get the reply back from the target but as soon as ioctl_cb returns we crash.
If you comment out SG_IO in iscsi_ioctl then the crash does not happen
(but the qemu does nto get serial number either)


I will look more into it tonight.


On Wed, Jun 19, 2013 at 2:17 AM, Laszlo Ersek ler...@redhat.com wrote:
 On 06/19/13 06:34, ronnie sahlberg wrote:
 I can reproduce with current QEMU.

 Ubuntu 13 crashes with if=virtio but if=ide is fine.


 But it seems dependent on the guest/kernel.

 For example Fedora-18-x86_64-Live-Desktop.iso  installs and runs just
 fine, even with virtio
 But both ubuntu-12.04-desktop-amd64.iso or
 ubuntu-13.04-desktop-amd64.iso crash with if=virtio


 Stack backtrace I got is
 #0  0x7f7a9e22d037 in __GI_raise (sig=sig@entry=6)
 at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
 #1  0x7f7a9e230698 in __GI_abort () at abort.c:90
 #2  0x7f7aa0a93ec8 in qemu_ram_addr_from_host_nofail (
 ptr=ptr@entry=0x2020202024008000) at /DATA/SRC/qemu-kvm/qemu/exec.c:1399
 #3  0x7f7aa0a94a50 in address_space_unmap (as=optimised out,
 buffer=0x2020202024008000, len=optimised out, is_write=optimised out,
 access_len=1) at /DATA/SRC/qemu-kvm/qemu/exec.c:2155
 #4  0x7f7aa0a94bef in cpu_physical_memory_unmap (buffer=optimised out,
 len=optimised out, is_write=optimised out, access_len=optimised 
 out)
 at /DATA/SRC/qemu-kvm/qemu/exec.c:2189
 #5  0x7f7aa0ad7867 in virtqueue_fill (vq=vq@entry=0x7f7aa34277f0,
 elem=elem@entry=0x7f7aa37ca328, len=1, idx=idx@entry=0)
 at /DATA/SRC/qemu-kvm/qemu/hw/virtio/virtio.c:243
 #6  0x7f7aa0ad79cf in virtqueue_push (vq=0x7f7aa34277f0,
 elem=elem@entry=0x7f7aa37ca328, len=optimised out)
 at /DATA/SRC/qemu-kvm/qemu/hw/virtio/virtio.c:279
 #7  0x7f7aa0aa9989 in virtio_blk_req_complete (
 req=req@entry=0x7f7aa37ca320, status=status@entry=0)
 at /DATA/SRC/qemu-kvm/qemu/hw/block/virtio-blk.c:49
 #8  0x7f7aa0aa9ffb in virtio_blk_handle_request (
 req=req@entry=0x7f7aa37ca320, mrb=mrb@entry=0x7fff7a7b2060)
 at /DATA/SRC/qemu-kvm/qemu/hw/block/virtio-blk.c:376

 Can you try the attached patch? It will produce quite a bit of output on
 stderr.

 Thanks
 Laszlo

Re: [Qemu-devel] [Bug 1191606] Re: qemu crashes with iscsi initiator (libiscsi) when using virtio

2013-06-20 Thread ronnie sahlberg

On Thu, Jun 20, 2013 at 7:47 AM, Laszlo Ersek ler...@redhat.com wrote:
 On 06/20/13 15:33, ronnie sahlberg wrote:
 http://pastebin.com/EuwZPna1

 Last few thousand lines from the log with your patch.


 The crash happens immediately after qemu has called out to iscsi_ioctl
 with SG_IO to read the serial numbers vpd page.
 We get the reply back from the target but as soon as ioctl_cb returns we 
 crash.
 If you comment out SG_IO in iscsi_ioctl then the crash does not happen
 (but the qemu does nto get serial number either)


 I will look more into it tonight.

   virtqueue_map_sg: mapped gpa=790a9000 at hva=0x7f0cb10a9000 for 
 length=4, is_write=1  (out: data)
   virtqueue_map_sg: mapped gpa=7726fc70 at hva=0x7f0caf26fc70 for 
 length=96, is_write=1 (out: sense)
   virtqueue_map_sg: mapped gpa=764e5aa0 at hva=0x7f0cae4e5aa0 for 
 length=16, is_write=1 (out: errors, data_len, sense_len, residual)
   virtqueue_map_sg: mapped gpa=764e5adc at hva=0x7f0cae4e5adc for 
 length=1, is_write=1  (out: status)
   virtqueue_map_sg: mapped gpa=764e5a90 at hva=0x7f0cae4e5a90 for 
 length=16, is_write=0 (in: type, ioprio, sector)
   virtqueue_map_sg: mapped gpa=7ab80578 at hva=0x7f0cb2b80578 for 
 length=6, is_write=0  (in: cmd)
   virtio_blk_handle_request: type=0x0002
   virtqueue_fill: unmapping hva=0x7f0c24008000 for length=4, access_len=1, 
 is_write=1
   Bad ram pointer 0x7f0c24008000

 This looks related, in virtio_blk_handle_scsi():

 } else if (req-elem.in_num  3) {
 /*
  * If we have more than 3 input segments the guest wants to actually
  * read data.
  */
 hdr.dxfer_direction = SG_DXFER_FROM_DEV;
 hdr.iovec_count = req-elem.in_num - 3;
 for (i = 0; i  hdr.iovec_count; i++)
 hdr.dxfer_len += req-elem.in_sg[i].iov_len;

 hdr.dxferp = req-elem.in_sg;
 } else {

 This sets
 - hdr.iovec_count to 1,
 - hdr.dxfer_len to 4,
 - hdr.dxferp as shown above,

 For struct sg_io_hdr (which is the type of hdr), the typedef 
 documentation are in include/scsi/sg.h:

 unsigned short iovec_count; /* [i] 0 implies no scatter gather */

 void __user *dxferp;/* [i], [*io] points to data transfer memory
   or scatter gather list */

 Now what we're seeing is a corruption of req-elem.in_sg[0].iov_base,
 whose address equals that of req-elem.in_sg (it's at offset 0 in the
 struct at subscript #0 in the array).

   virtqueue_map_sg: mapped gpa=790a9000 at hva=0x7f0cb10a9000 for 
 length=4, is_write=1
   [...]
   virtio_blk_handle_request: type=0x0002
   virtqueue_fill: unmapping hva=0x7f0c24008000 for length=4, access_len=1, 
 is_write=1
   Bad ram pointer 0x7f0c24008000

 First I don't understand how access_len can only be 1. But, in any
 case, if the req-elem.in_sg[0].iov_base pointer is stored in
 little-endian order, and the kernel (or iscsi_scsi_command_async()?) for
 whatever reason misinterprets hdr.dxferp to point at an actual receive
 buffer (instead of an iovec array), that would be consistent with the
 symptoms:

Ah, that makes sense.

block.iscsi.c   (https://github.com/qemu/qemu/blob/master/block/iscsi.c)
does assume that ioh-dxferp is a pointer to the buffer and that there
is no scatter gather.
See lines  745-749.

I did not know that ioctl() could take a scatter/gather list.


I cant test now  but if I understand right then
lines 745-749 should be replaced with something that does

* check ioh-iovec_count IF if it zero then there is no scatter gather
and ioh-dxferp points to a buffer,  so just do what we do today.
* IF iovec_count is  0  then dxferp is NOT a pointer to a buffer but
a pointer to an array of iovec then
traverse the iovec array and add these as buffers to the task just
like we do for readv. For example similar to the loop to add the
iovecs in lines 449-453


I will try this tonight.



   0x7f0cb10a9000 --- original value of req-elem.in_sg[0].iov_base
   0x7f0c24008000 --- corrupted value
  --- 4 low bytes overwritten by SCSI data

 Laszl

[Qemu-devel] [PATCH] iSCSI fix crash when using virtio and libiscsi

2013-06-20 Thread Ronnie Sahlberg

Stefan, List

Please find a patch that fixes the crashes for using virtio with libiscsi.
The problem was that block/iscsi.c always assumed we got a plain buffer to read 
data into, and when we got an iovector array instead we would overwrite 
pointers with garbage and crash.

Since we can get iovectors for the write case as well I have added a fix for 
when the guest is writing data to the target to handle the iovector case as 
well.


The new calls added are not protected with (LIBISCSI_FEATURE_IOVECTOR) checks
since anyone building a new/current version of qemu should probably also build
against a current libiscsi.
I will send patches later to remove the current (LIBISCSI_FEATURE_IOVECTOR) 
checks in the rest of the file.


regards
ronnie sahlberg

[Qemu-devel] [PATCH] Fix iSCSI crash on SG_IO with an iovector

2013-06-20 Thread Ronnie Sahlberg

Don't assume that SG_IO is always invoked with a simple buffer,
check the iovec_count and if it is  1 then we need to pass an array
of iovectors to libiscsi instead of just a plain buffer.

Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
---
 block/iscsi.c |   31 ---
 1 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 0bbf0b1..2d1cb4e 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -727,25 +727,42 @@ static BlockDriverAIOCB *iscsi_aio_ioctl(BlockDriverState 
*bs,
 memcpy(acb-task-cdb[0], acb-ioh-cmdp, acb-ioh-cmd_len);
 acb-task-expxferlen = acb-ioh-dxfer_len;
 
+data.size = 0;
 if (acb-task-xfer_dir == SCSI_XFER_WRITE) {
-data.data = acb-ioh-dxferp;
-data.size = acb-ioh-dxfer_len;
+if (acb-ioh-iovec_count == 0) {
+data.data = acb-ioh-dxferp;
+data.size = acb-ioh-dxfer_len;
+}
 }
 if (iscsi_scsi_command_async(iscsi, iscsilun-lun, acb-task,
  iscsi_aio_ioctl_cb,
- (acb-task-xfer_dir == SCSI_XFER_WRITE) ?
- data : NULL,
+ (data.size  0) ? data : NULL,
  acb) != 0) {
 scsi_free_scsi_task(acb-task);
 qemu_aio_release(acb);
 return NULL;
 }
 
+/* We got an iovector for writing to the target */
+if (acb-task-xfer_dir == SCSI_XFER_WRITE) {
+if (acb-ioh-iovec_count  0) {
+scsi_task_set_iov_out(acb-task,
+  (struct scsi_iovec *) acb-ioh-dxferp,
+  acb-ioh-iovec_count);
+}
+}
+
 /* tell libiscsi to read straight into the buffer we got from ioctl */
 if (acb-task-xfer_dir == SCSI_XFER_READ) {
-scsi_task_add_data_in_buffer(acb-task,
- acb-ioh-dxfer_len,
- acb-ioh-dxferp);
+if (acb-ioh-iovec_count == 0) {
+scsi_task_add_data_in_buffer(acb-task,
+ acb-ioh-dxfer_len,
+ acb-ioh-dxferp);
+} else {
+scsi_task_set_iov_in(acb-task,
+ (struct scsi_iovec *) acb-ioh-dxferp,
+ acb-ioh-iovec_count);
+}
 }
 
 iscsi_set_events(iscsilun);
-- 
1.7.3.1

Re: [Qemu-devel] [Bug 1191606] Re: qemu crashes with iscsi initiator (libiscsi) when using virtio

2013-06-18 Thread ronnie sahlberg

I can reproduce with current QEMU.

Ubuntu 13 crashes with if=virtio but if=ide is fine.


But it seems dependent on the guest/kernel.

For example Fedora-18-x86_64-Live-Desktop.iso  installs and runs just
fine, even with virtio
But both ubuntu-12.04-desktop-amd64.iso or
ubuntu-13.04-desktop-amd64.iso crash with if=virtio


Stack backtrace I got is
#0  0x7f7a9e22d037 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f7a9e230698 in __GI_abort () at abort.c:90
#2  0x7f7aa0a93ec8 in qemu_ram_addr_from_host_nofail (
ptr=ptr@entry=0x2020202024008000) at /DATA/SRC/qemu-kvm/qemu/exec.c:1399
#3  0x7f7aa0a94a50 in address_space_unmap (as=optimised out,
buffer=0x2020202024008000, len=optimised out, is_write=optimised out,
access_len=1) at /DATA/SRC/qemu-kvm/qemu/exec.c:2155
#4  0x7f7aa0a94bef in cpu_physical_memory_unmap (buffer=optimised out,
len=optimised out, is_write=optimised out, access_len=optimised out)
at /DATA/SRC/qemu-kvm/qemu/exec.c:2189
#5  0x7f7aa0ad7867 in virtqueue_fill (vq=vq@entry=0x7f7aa34277f0,
elem=elem@entry=0x7f7aa37ca328, len=1, idx=idx@entry=0)
at /DATA/SRC/qemu-kvm/qemu/hw/virtio/virtio.c:243
#6  0x7f7aa0ad79cf in virtqueue_push (vq=0x7f7aa34277f0,
elem=elem@entry=0x7f7aa37ca328, len=optimised out)
at /DATA/SRC/qemu-kvm/qemu/hw/virtio/virtio.c:279
#7  0x7f7aa0aa9989 in virtio_blk_req_complete (
req=req@entry=0x7f7aa37ca320, status=status@entry=0)
at /DATA/SRC/qemu-kvm/qemu/hw/block/virtio-blk.c:49
#8  0x7f7aa0aa9ffb in virtio_blk_handle_request (
req=req@entry=0x7f7aa37ca320, mrb=mrb@entry=0x7fff7a7b2060)
at /DATA/SRC/qemu-kvm/qemu/hw/block/virtio-blk.c:376
---Type return to continue, or q return to quit---
#9  0x7f7aa0aaa625 in virtio_blk_handle_output (vdev=optimised out,
vq=optimised out) at /DATA/SRC/qemu-kvm/qemu/hw/block/virtio-blk.c:412
#10 0x7f7aa0a17c0e in qemu_iohandler_poll (pollfds=0x7f7aa335f800,
ret=ret@entry=1) at iohandler.c:143
#11 0x7f7aa0a181a6 in main_loop_wait (nonblocking=optimised out)
at main-loop.c:466
#12 0x7f7aa08f6fb9 in main_loop () at vl.c:2028
#13 main (argc=optimised out, argv=optimised out, envp=optimised out)
at vl.c:4425



On Tue, Jun 18, 2013 at 12:36 PM, Laszlo Ersek ler...@redhat.com wrote:
 On 06/18/13 20:01, Klaus Hochlehnert wrote:
 I'll see what I can do to recompile qemu with debugging information.
 Maybe tomorrow.

 But one other question. I thought this is the normal qemu bug
 reporting or is it Ubuntu only? I tried with the latest release and
 followed the Report a bug-link from the qemu web site.

 The tracker on launchpad is for upstream bugs, AFAIK. I only referred to
 Ubuntu because that was your host OS and because for a while I wasn't
 aware that your qemu version was independent from your host OS.

 The main thing is the debug symbols. Since for a while I was assuming
 that you had run into the abort() with your distro's qemu package, I
 tried to help with your distro's debug symbols for qemu.

 Thanks
 Laszlo

Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O

2013-05-29 Thread ronnie sahlberg

Julian,

Stefan's concerns are valid.

(Hopefully, kernel is harder to exploit and more carefully audited.)




On Wed, May 29, 2013 at 9:02 AM, Julian Stecklina
jstec...@os.inf.tu-dresden.de wrote:
 On 05/29/2013 04:21 PM, Stefan Hajnoczi wrote:
 The fact that a single switch process has shared memory access to all
 guests' RAM is critical.  If the switch process is exploited, then that
 exposes other guests' data!  (Think of a multi-tenant host with guests
 belonging to different users.)

 True. But people don't mind having instruction decoding and half of
 virtio in the kernel these days, so it can't be that security critical...

 Julian

Re: [Qemu-devel] virtio-scsi WRITE_VERIFY crash

2013-04-08 Thread ronnie sahlberg

I dont think QEMU scsi emulation supports WRITE_VERIFY.

In the past there was a few instances where the code in the SCSI emulation
that determines the transfer direction, based on the opcode,  did not contain
a new opcode, so it got the xfer direction wrong and crashed.

I dont have access to my box with QEMU right now,
but I would check if it is something similar to this patch :

http://lists.gnu.org/archive/html/qemu-devel/2012-04/msg04130.html



regards
ronnie sahlberg


On Mon, Apr 8, 2013 at 8:53 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Apr 05, 2013 at 11:30:00AM -0700, Venkatesh Srinivas wrote:
 When a Linux guest does a simple 'sg_verify /dev/scsi disk on a
 virtio-scsi HBA', qemu (-master from git) crashes, tripping an
 assertion in scsi-disk.c:scsi_dma_complete(), that the completing DMA
 command has no IOCB.

 The callpath is:
 scsi_dma_complete
 dma_complete
 dma_bdrv_cb
 dma_bdrv_io
 dma_bdrv_read
 scsi_do_read
 bdrv_co_em_bh
 aio_bh_poll
 aio_poll.

 At the assertion, we have a zero-element iovector and the request has
 a status of -1.

 CCing Paolo Bonzini and Asias He.  See the ./MAINTAINERS file to find
 people that can help with specific QEMU subsystems.

 It would be nice to include a full gdb backtrace when possible since
 that may include extra information like that value of arguments in the
 call stack.

 Stefan

Re: [Qemu-devel] [RFC] find_next_bit optimizations

2013-03-11 Thread ronnie sahlberg

Even more efficient might be to do bitwise instead of logical or

if (tmp | d1 | d2 | d3) {

that should remove 3 of the 4 conditional jumps
and should become 3 bitwise ors and one conditional jump


On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 11/03/2013 16:37, Peter Lieven ha scritto:

 Am 11.03.2013 um 16:29 schrieb Paolo Bonzini pbonz...@redhat.com:

 Il 11/03/2013 16:24, Peter Lieven ha scritto:

 How would that be different in your patch?  But you can solve it by
 making two = loops, one checking for 4*BITS_PER_LONG and one checking
 BITS_PER_LONG.

 This is what I have now:

 diff --git a/util/bitops.c b/util/bitops.c
 index e72237a..b0dc93f 100644
 --- a/util/bitops.c
 +++ b/util/bitops.c
 @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
 unsigned long size,
 const unsigned long *p = addr + BITOP_WORD(offset);
 unsigned long result = offset  ~(BITS_PER_LONG-1);
 unsigned long tmp;
 +unsigned long d0,d1,d2,d3;

 if (offset = size) {
 return size;
 }
 size -= result;
 -offset %= BITS_PER_LONG;
 +offset = (BITS_PER_LONG-1);
 if (offset) {
 tmp = *(p++);
 tmp = (~0UL  offset);
 @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, 
 unsigned long size,
 size -= BITS_PER_LONG;
 result += BITS_PER_LONG;
 }
 -while (size  ~(BITS_PER_LONG-1)) {
 +while (size = 4*BITS_PER_LONG) {
 +d0 = *p;
 +d1 = *(p+1);
 +d2 = *(p+2);
 +d3 = *(p+3);
 +if (d0 || d1 || d2 || d3) {
 +break;
 +}
 +p+=4;
 +result += 4*BITS_PER_LONG;
 +size -= 4*BITS_PER_LONG;
 +}
 +while (size = BITS_PER_LONG) {
 if ((tmp = *(p++))) {
 goto found_middle;
 }


 Minus the %= vs. =,

 Reviewed-by: Paolo Bonzini pbonz...@redhat.com

 Perhaps:

tmp = *p;
d1 = *(p+1);
d2 = *(p+2);
d3 = *(p+3);
if (tmp) {
goto found_middle;
}
if (d1 || d2 || d3) {
break;
}

 i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?

 It depends on the target and how expensive branches are.

 i would guess its sth like one addition w/ carry and 1 test?

 It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
 followed by a compare-and-jump.

 That is, either:

  test %r8, %r8
  jz   second_loop
  test %r9, %r9
  jz   second_loop
  test %r10, %r10
  jz   second_loop
  test %r11, %r11
  jz   second_loop

 or

  or %r9, %r8
  or %r11, %r10
  or %r8, %r10
  jz   second_loop

 Don't let the length of the code fool you.  The processor knows how to
 optimize all of these, and GCC knows too.

 your proposed change would introduce 2 tests (maybe)?

 Yes, but I expect they to be fairly well predicted.

 what about this to be sure?

tmp = *p;
d1 = *(p+1);
d2 = *(p+2);
d3 = *(p+3);
if (tmp || d1 || d2 || d3) {
if (tmp) {
goto found_middle;

 I suspect that GCC would rewrite it my version (definitely if it
 produces 4 compare-and-jumps; but possibly it does it even if it goes
 for bitwise ORs, I haven't checked.

 Regarding your other question (one last thought. would it make sense to
 update only `size`in the while loops and compute the `result` at the end
 as `orgsize` - `size`?), again the compiler knows better and might even
 do this for you.  It will likely drop the p increases and use p[result],
 so if you do that change you may even get the same code, only this time
 p is increased and you get an extra subtraction at the end. :)

 Bottom line: don't try to outsmart an optimizing C compiler on
 micro-optimization, unless you have benchmarked it and it shows there is
 a problem.

 Paolo

}
break;
}

 Peter

Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread ronnie sahlberg

Do you really need to resolve the conflicts?
It might be easier and sufficient to just flag those hashes where a
conflict has been detected as : dont dedup this hash anymore,
collissions have been seen.


On Wed, Jan 2, 2013 at 10:40 AM, Benoît Canet benoit.ca...@irqsave.net wrote:
 Le Wednesday 02 Jan 2013 à 12:26:37 (-0600), Troy Benjegerdes a écrit :
 The probability may be 'low' but it is not zero. Just because it's
 hard to calculate the hash doesn't mean you can't do it. If your
 input data is not random the probability of a hash collision is
 going to get scewed.

 Read about how Bitcoin uses hashes.

 I need a budget of around $10,000 or so for some FPGAs and/or GPU cards,
 and I can make a regression test that will create deduplication hash
 collisions on purpose.

 It's not a problem as Eric pointed out while reviewing the previous patchset
 there is a small place left with zeroes on the deduplication block.
 A bit could be set on it when a collision is detected and an offset could 
 point
 to a cluster used to resolve collisions.



 On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote:
   How does this code handle hash collisions, and do you have some 
   regression
   tests that purposefully create a dedup hash collision, and verify that 
   the
   'right thing' happens?
 
  The two hash function that can be used are cryptographics and not broken 
  yet.
  So nobody knows how to generate a collision.
 
  You can do the math to calculate the probability of collision using a 256 
  bit
  hash while processing 1EiB of data the result is so low you can consider it
  won't happen.
  The sha256 ZFS deduplication works the same way regarding collisions.
 
  I currently use qemu-io-test for testing purpose and iozone with the -w 
  flag in
  the guest.
  I would like to find a good deduplication stress test to run in a guest.
 
  Regards
 
  Beno?t
 
   It's great that this almost works, but it seems rather dangerous to put
   something like this into the mainline code without some regression tests.
  
   (I'm also suspecting the regression test will be a great way to find
   flakey hardware)
  
   --
   Troy Benjegerdes'da hozer' 
   ho...@hozed.org
  
   Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
   software  hardware (http://q3u.be) stuff and not get a real job.
   Charles Shultz had the best answer:
  
   Why do musicians compose symphonies and poets write poems? They do it
   because life wouldn't have any meaning for them if they didn't. That's 
   why
   I draw cartoons. It's my life. -- Charles Shultz

 --
 --
 Troy Benjegerdes'da hozer' ho...@hozed.org

 Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
 software  hardware (http://q3u.be) stuff and not get a real job.
 Charles Shultz had the best answer:

 Why do musicians compose symphonies and poets write poems? They do it
 because life wouldn't have any meaning for them if they didn't. That's why
 I draw cartoons. It's my life. -- Charles Shultz

Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread ronnie sahlberg

On Wed, Jan 2, 2013 at 11:18 AM, Troy Benjegerdes ho...@hozed.org wrote:
 If you do get a hash collision, it's a rather exceptional event, so I'd
 say every effort should be made to log the event and the data that created
 it in multiple places.

 There are three questions I'd ask on a hash collision:

 1) was it the data?
 2) was it the hardware?
 3) was it a software bug?

Yes, that is probably good too, and saving off the old and new block
content that collided.

Unless you are checksumming the blocks, I suspect that the most common
reason for collisions would just be cases where the original block
was corrupted/changed on disk and you dont detect it and then when you
re-write an identical one the blocks no longer match and thus you get
a false collision.

Re: [Qemu-devel] [PATCH][RESEND] iscsi: add support for iSCSI NOPs

2012-12-03 Thread ronnie sahlberg

Acked-By: ronniesahlb...@gmail.com (Ronnie Sahlberg)


This verified that the service is actually operational and is much
more reliable than TCP-KEEPALIVES.
This is the proper way to monitor that the iscsi target is alive.

We should as a later patch add the ability to configure this via the
qemu config file instead of using hardcoded values.


regards
ronnie sahlberg


On Mon, Dec 3, 2012 at 11:34 AM, Peter Lieven p...@dlhnet.de wrote:
 This patch will send NOP-Out PDUs every 5 seconds to the iSCSI target.
 If a consecutive number of NOP-In replies fail a reconnect is initiated.
 iSCSI NOPs help to ensure that the connection to the target is still 
 operational.
 This should not, but in reality may be the case even if the TCP connection is 
 still
 alive if there are bugs in either the target or the initiator implementation.

 Reported-by: Ronnie Sahlberg ronniesahlb...@gmail.com
 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |   43 +++
  1 file changed, 43 insertions(+)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index d0b1a10..fab4c8b 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -47,6 +47,9 @@ typedef struct IscsiLun {
  int block_size;
  uint64_t num_blocks;
  int events;
 +
 +QEMUTimer *nop_timer;
 +int nops_in_flight;
  } IscsiLun;

  typedef struct IscsiAIOCB {
 @@ -72,6 +75,9 @@ struct IscsiTask {
  int complete;
  };

 +#define NOP_INTERVAL 5000
 +#define MAX_NOP_FAILURES 3
 +
  static void
  iscsi_bh_cb(void *p)
  {
 @@ -925,6 +931,35 @@ static char *parse_initiator_name(const char *target)
  }
  }

 +static void iscsi_nop_cb(struct iscsi_context *iscsi, int status, void 
 *command_data, void *private_data)
 +{
 +IscsiLun *iscsilun = private_data;
 +
 +if (iscsilun) {
 +iscsilun-nops_in_flight = 0;
 +}
 +}
 +
 +static void iscsi_nop_timed_event(void *opaque)
 +{
 +IscsiLun *iscsilun = opaque;
 +
 +if (iscsilun-nops_in_flight  MAX_NOP_FAILURES) {
 +error_report(iSCSI: NOP timeout. Reconnecting...);
 +iscsi_reconnect(iscsilun-iscsi);
 +iscsilun-nops_in_flight = 0;
 +}
 +
 +if (iscsi_nop_out_async(iscsilun-iscsi, iscsi_nop_cb, NULL, 0, 
 iscsilun) != 0) {
 +error_report(iSCSI: failed to sent NOP-Out. Disabling NOP 
 messages.);
 +return;
 +}
 +
 +qemu_mod_timer(iscsilun-nop_timer, qemu_get_clock_ms(rt_clock) + 
 NOP_INTERVAL);
 +iscsi_set_events(iscsilun);
 +iscsilun-nops_in_flight++;
 +}
 +
  /*
   * We support iscsi url's on the form
   * iscsi://[username%password@]host[:port]/targetname/lun
 @@ -1036,6 +1071,10 @@ static int iscsi_open(BlockDriverState *bs, const char 
 *filename, int flags)

  ret = 0;

 +/* Set up a timer for sending out iSCSI NOPs */
 +iscsilun-nop_timer = qemu_new_timer_ms(rt_clock, iscsi_nop_timed_event, 
 iscsilun);
 +qemu_mod_timer(iscsilun-nop_timer, qemu_get_clock_ms(rt_clock) + 
 NOP_INTERVAL);
 +
  out:
  if (initiator_name != NULL) {
  g_free(initiator_name);
 @@ -1058,6 +1097,10 @@ static void iscsi_close(BlockDriverState *bs)
  IscsiLun *iscsilun = bs-opaque;
  struct iscsi_context *iscsi = iscsilun-iscsi;

 +if (iscsilun-nop_timer) {
 +qemu_del_timer(iscsilun-nop_timer);
 +qemu_free_timer(iscsilun-nop_timer);
 +}
  qemu_aio_set_fd_handler(iscsi_get_fd(iscsi), NULL, NULL, NULL, NULL);
  iscsi_destroy_context(iscsi);
  memset(iscsilun, 0, sizeof(IscsiLun));
 --
 1.7.9.5

Re: [Qemu-devel] [PATCH] iscsi: partly avoid iovec linearization in iscsi_aio_writev

2012-11-19 Thread ronnie sahlberg

On Mon, Nov 19, 2012 at 7:18 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 19/11/2012 15:58, Peter Lieven ha scritto:

 -/* XXX we should pass the iovec to write16 to avoid the extra copy */
 -/* this will allow us to get rid of 'buf' completely */
  size = nb_sectors * BDRV_SECTOR_SIZE;
 -acb-buf = g_malloc(size);
 -qemu_iovec_to_buf(acb-qiov, 0, acb-buf, size);
 +data.size = size;
 +
 +/* if the iovec only contains one buffer we can pass it directly */
 +if (acb-qiov-niov == 1) {
 +acb-buf = NULL;
 +data.data = acb-qiov-iov[0].iov_base;
 +} else {
 +acb-buf = g_malloc(size);
 +qemu_iovec_to_buf(acb-qiov, 0, acb-buf, size);
 +data.data = acb-buf;
 +}

 Looks good, but how hard is it to get rid of the bounce buffer
 completely, as mentioned in the comment?  Ronnie?


I am working on that,  but I will need the thanksgiving weekend to
finish it and test it.
It also means breaking the API, since it would introduce new
functionality  so it will take a little while after finishing the
libiscsi part before we could/should introduce it in qemu.

The vast majority of writes seems to be a vector with only a single
element,  so Peters change is a good optimization for the common case,
until I get the proper fix finished and tested and pushed in a new
release of libiscsi.

I.e.  I think Peters change is good for now.  It solves the problem
with extra memcpy for the majority of writes until we can get the
full solution ready.


 Paolo

Re: [Qemu-devel] [PATCH] iscsi: fix segfault in url parsing

2012-11-15 Thread ronnie sahlberg

Acked-By: ronniesahlb...@gmail.com

On Thu, Nov 15, 2012 at 6:42 AM, Peter Lieven p...@dlhnet.de wrote:
 If an invalid URL is specified iscsi_get_error(iscsi) is called
 with iscsi == NULL.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index d0b1a10..b5c3161 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -947,8 +947,7 @@ static int iscsi_open(BlockDriverState *bs, const char
 *filename, int flags)

  iscsi_url = iscsi_parse_full_url(iscsi, filename);
  if (iscsi_url == NULL) {
 -error_report(Failed to parse URL : %s %s, filename,
 - iscsi_get_error(iscsi));
 +error_report(Failed to parse URL : %s, filename);
  ret = -EINVAL;
  goto out;
  }
 --
 1.7.9.5

Re: [Qemu-devel] [PATCH] iscsi: fix deadlock during login

2012-11-15 Thread ronnie sahlberg

I dont know if we should switch to use synchronous code here.
It is much nicer if all code is async.

Is it possible to add a timeout instead that would break out if the
connect/login has not completed within a certain amount of time?


regards
ronnie sahlberg

On Thu, Nov 15, 2012 at 6:50 AM, Peter Lieven p...@dlhnet.de wrote:
 If the connection is interrupted before the first login is successfully
 completed qemu-kvm is waiting forever in qemu_aio_wait().

 This is fixed by performing an sync login to the target. If the
 connection breaks after the first successful login errors are
 handled internally by libiscsi.

 Signed-off-by: Peter Lieven p...@kamp.de
 ---
  block/iscsi.c |   56
 +---
  1 file changed, 21 insertions(+), 35 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index b5c3161..f44bb57 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -798,30 +798,6 @@ iscsi_inquiry_cb(struct iscsi_context *iscsi, int
 status, void *command_data,
  }
  }

 -static void
 -iscsi_connect_cb(struct iscsi_context *iscsi, int status, void
 *command_data,
 - void *opaque)
 -{
 -struct IscsiTask *itask = opaque;
 -struct scsi_task *task;
 -
 -if (status != 0) {
 -itask-status   = 1;
 -itask-complete = 1;
 -return;
 -}
 -
 -task = iscsi_inquiry_task(iscsi, itask-iscsilun-lun,
 -  0, 0, 36,
 -  iscsi_inquiry_cb, opaque);
 -if (task == NULL) {
 -error_report(iSCSI: failed to send inquiry command.);
 -itask-status   = 1;
 -itask-complete = 1;
 -return;
 -}
 -}
 -
  static int parse_chap(struct iscsi_context *iscsi, const char *target)
  {
  QemuOptsList *list;
 @@ -934,7 +910,8 @@ static int iscsi_open(BlockDriverState *bs, const char
 *filename, int flags)
  IscsiLun *iscsilun = bs-opaque;
  struct iscsi_context *iscsi = NULL;
  struct iscsi_url *iscsi_url = NULL;
 -struct IscsiTask task;
 +struct IscsiTask itask;
 +struct scsi_task *task;
  char *initiator_name = NULL;
  int ret;

 @@ -997,27 +974,36 @@ static int iscsi_open(BlockDriverState *bs, const char
 *filename, int flags)
  /* check if we got HEADER_DIGEST via the options */
  parse_header_digest(iscsi, iscsi_url-target);

 -task.iscsilun = iscsilun;
 -task.status = 0;
 -task.complete = 0;
 -task.bs = bs;
 +if (iscsi_full_connect_sync(iscsi, iscsi_url-portal, iscsi_url-lun)
 != 0) {
 +error_report(iSCSI: Failed to connect to LUN : %s,
 +iscsi_get_error(iscsi));
 +ret = -EINVAL;
 +goto out;
 +}
 +
 +itask.iscsilun = iscsilun;
 +itask.status = 0;
 +itask.complete = 0;
 +itask.bs = bs;

  iscsilun-iscsi = iscsi;
  iscsilun-lun   = iscsi_url-lun;

 -if (iscsi_full_connect_async(iscsi, iscsi_url-portal, iscsi_url-lun,
 - iscsi_connect_cb, task)
 -!= 0) {
 -error_report(iSCSI: Failed to start async connect.);
 +task = iscsi_inquiry_task(iscsi, iscsilun-lun,
 +  0, 0, 36,
 +  iscsi_inquiry_cb, itask);
 +if (task == NULL) {
 +error_report(iSCSI: failed to send inquiry command.);
  ret = -EINVAL;
  goto out;
  }

 -while (!task.complete) {
 +while (!itask.complete) {
  iscsi_set_events(iscsilun);
  qemu_aio_wait();
  }
 -if (task.status != 0) {
 +
 +if (itask.status != 0) {
  error_report(iSCSI: Failed to connect to LUN : %s,
   iscsi_get_error(iscsi));
  ret = -EINVAL;
 --
 1.7.9.5

Re: [Qemu-devel] [PATCH] iscsi: fix deadlock during login

2012-11-15 Thread ronnie sahlberg

On Thu, Nov 15, 2012 at 7:54 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 15/11/2012 15:57, ronnie sahlberg ha scritto:
 I dont know if we should switch to use synchronous code here.
 It is much nicer if all code is async.

 bdrv_open is generally synchronous, so I think Peter's patch is ok.

I was thinking about the case where you disconnect/reconnect a device
at runtime. Like swapping the medium in a CDROM.
If bdrv_open() is synchronous and blocks for a long time, would that
not impact the rest of QEMU?

Otherwise:
Acked-by:  ronniesahlb...@gmail.com


 Paolo

 Is it possible to add a timeout instead that would break out if the
 connect/login has not completed within a certain amount of time?

Re: [Qemu-devel] Ubuntu/Debian Installer + Virtio-SCSI - Bad ram pointer

2012-10-31 Thread ronnie sahlberg

On Tue, Oct 30, 2012 at 10:48 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Tue, Oct 30, 2012 at 10:09 PM, ronnie sahlberg
 ronniesahlb...@gmail.com wrote:
 About half a year there was an issue where recent kernels had added
 support to start using new scsi opcodes,  but the qemu functions that
 determine which transfer direction is used for this opcode had not
 yet been updated, so that the opcode was sent with the wrong transfer
 direction.

 That caused the guests memory to be overwritten and crash.

 I dont have (easy) access to the git tree right now, but it was a
 patch for the ATA_PASSTHROUGH command that fixed that.

 This patch?

 http://patchwork.ozlabs.org/patch/174946/

 Stefan

This is the one I was thinking about :
381b634c275ca1a2806e97392527bbfc01bcb333

But that also crashed when using local /dev/sg* devices.

Re: [Qemu-devel] Ubuntu/Debian Installer + Virtio-SCSI - Bad ram pointer

2012-10-30 Thread ronnie sahlberg

But older distros/kernels work fine?

Can you take a network trace?

About half a year there was an issue where recent kernels had added
support to start using new scsi opcodes, but the qemu functions that
determine which transfer direction is used for this opcode had not
yet been updated, so that the opcode was sent with the wrong transfer
direction.

That caused the guests memory to be overwritten and crash.

I dont have (easy) access to the git tree right now, but it was a
patch for the ATA_PASSTHROUGH command that fixed that.

Could you take a network trace and check maybe this is something
similar ? I.e. does the guest send an unusual scsi opcode just
before the crash ?

regards
ronnie sahlberg

On Tue, Oct 30, 2012 at 12:37 PM, Peter Lieven p...@dlhnet.de wrote:
Am 30.10.2012 19:27, schrieb Stefan Hajnoczi:

On Tue, Oct 30, 2012 at 4:56 PM, Peter Lieven p...@dlhnet.de wrote:

On 30.10.2012 09:32, Stefan Hajnoczi wrote:

On Mon, Oct 29, 2012 at 03:09:37PM +0100, Peter Lieven wrote:

Hi,

Bug subject should be virtio-blk, not virtio-scsi. virtio-scsi is a
different virtio device type from virtoi-blk and is not present in the
backtrace you posted.

Sounds pedantic but I want to make sure this gets chalked up against the
right device :).

If I try to Install Ubuntu 12.04 LTS / 12.10 64-bit on a virtio
storage backend that supports iSCSI
qemu-kvm crashes reliably with the following error:

Are you using vanilla qemu-kvm-1.2.0 or are there patches applied?

Have you tried qemu-kvm.git/master?

Have you tried a local raw disk image to check whether libiscsi is
involved?

Bad ram pointer 0x3039303620008000

This happens directly after the confirmation of the Timezone before
the Disk is partitioned.

If I specify -global virtio-blk-pci.scsi=off in the cmdline this
does not happen.

Here is a stack trace:

Thread 1 (Thread 0x77fee700 (LWP 8226)):
#0 0x763c0a10 in abort () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 https://github.com/sahlberg/libiscsi/issues/1
0x557b751d in qemu_ram_addr_from_host_nofail (
ptr=0x3039303620008000) at /usr/src/qemu-kvm-1.2.0/exec.c:2835
ram_addr = 0
#2 https://github.com/sahlberg/libiscsi/issues/2
0x557b9177 in cpu_physical_memory_unmap (
buffer=0x3039303620008000, len=4986663671065686081, is_write=1,
access_len=1) at /usr/src/qemu-kvm-1.2.0/exec.c:3645

buffer and len are ASCII junk. It appears to be hex digits and it's not
clear where they come from.

It would be interesting to print *elem one stack frame up in #3
virtqueue_fill() to show the iovecs and in/out counts.

(gdb) print *elem

Great, thanks for providing this info:

$6 = {index = 3, out_num = 2, in_num = 4, in_addr = {1914920960,
1916656688,
2024130072, 2024130088, 0 repeats 508 times, 4129, 93825009696000,
140737328183160, 0 repeats 509 times}, out_addr = {2024130056,
2038414056, 0, 8256, 4128, 93824999311936, 0, 3, 0 repeats 512
times,
12385, 93825009696000, 140737328183160, 0 repeats 501 times},

Up to here everything is fine.

in_sg =
{{
iov_base = 0x3039303620008000, iov_len = 4986663671065686081}, {
iov_base = 0x383038454635, iov_len = 3544389261899019573}, {

The fields are bogus, in_sg has been overwritten with ASCII data.
Unfortunately I don't see any hint of where this ASCII data came from
yet.

The hdr fields you provided in stack frame #6 show that in_sg was
overwritten during or after the bdrv_ioctl() call. We pulled valid
data out of the vring and mapped buffers correctly. But something is
overwriting in_sg and when we complete the request we blow up due to
the bogus values.

Ok. What I have to mention. I've been testing with qemu-kvm 1.2.0
and libiscsi for a few weeks now. Its been very stable. The only thing
it blows up is during the debian/ubuntu installer. Ubuntu itself for
instance is running flawlessly. My guess is that the installer is probing
for something. The installer itself also runs flawlessly when I disable
scsi passthru with scsi=off.

Please post your full qemu-kvm command-line.

/usr/bin/qemu-kvm-1.2.0 -net
tap,vlan=164,script=no,downscript=no,ifname=tap0 -net
nic,vlan=164,model=e1000,macaddr=52:54:00:ff:01:35 -iscsi
initiator-name=iqn.2005-03.org.virtual-core:0025b51f001c -drive
format=iscsi,file=iscsi://172.21.200.56/iqn.2001-05.com.equallogic:0-8a0906-335f4e007-d29001a3355508e8-libiscsi-test-hd0/0,if=virtio,cache=none,aio=native
-m 2048 -smp 2,sockets=1,cores=2,threads=1 -monitor
tcp:0:4002,server,nowait -vnc :2 -qmp tcp:0:3002,server,nowait -name
'libiscsi-debug' -boot order=dc,menu=off -k de -pidfile
/var/run/qemu/vm-280.pid -mem-path /hugepages -mem-prealloc -cpu
host,+x2apic,model_id='Intel(R) Xeon(R) CPU L5640 @ 2.27GHz',-tsc
-rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus

Please also post the exact qemu-kvm version you are using. I can see
it's based on qemu

Re: [Qemu-devel] qemu-img convert with block driver without .bdrv_create (like iscsi)

2012-10-25 Thread ronnie sahlberg

On Thu, Oct 25, 2012 at 1:02 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 25.10.2012 09:52, schrieb Paolo Bonzini:
 Il 25/10/2012 09:46, Kevin Wolf ha scritto:
 1)add a .bdrv_create in block/iscsi.c ?

 (like host_device block driver, only open/close the device and check if 
 size if big enough)
 Yes, this is the right way.

 Could it be a default implementation of .bdrv_create (i.e. something
 you'd do in bdrv_create if the protocol doesn't have it)?

 No, there are block drivers that really can't create images. They should
 keep failing.

Technically, you can not create new LUNs via the iscsi protocol
itself, you can only access pre-existing luns
that have been created by some out-of-band method.

So basically, with iscsi you can only ever access already preexisting luns.

In that case I think requiring a .bdrv_create feels wrong. We are not
creating a new LUN here but just opening an already existing LUN.


What about changing bdrv_create() so that IF there is no .bdrv_create
method, then assume it might be a pre-existing file in which case we
fallback to use .bdrv_open instead.

regards
ronnie sahlberg

Re: [Qemu-devel] qemu-img convert with block driver without .bdrv_create (like iscsi)

2012-10-25 Thread ronnie sahlberg

On Thu, Oct 25, 2012 at 6:58 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 25/10/2012 15:41, ronnie sahlberg ha scritto:
 On Thu, Oct 25, 2012 at 1:02 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 25.10.2012 09:52, schrieb Paolo Bonzini:
 Il 25/10/2012 09:46, Kevin Wolf ha scritto:
 1)add a .bdrv_create in block/iscsi.c ?

 (like host_device block driver, only open/close the device and check if 
 size if big enough)
 Yes, this is the right way.

 Could it be a default implementation of .bdrv_create (i.e. something
 you'd do in bdrv_create if the protocol doesn't have it)?

 No, there are block drivers that really can't create images. They should
 keep failing.

 Technically, you can not create new LUNs via the iscsi protocol
 itself, you can only access pre-existing luns
 that have been created by some out-of-band method.

 So basically, with iscsi you can only ever access already preexisting luns.

 In that case I think requiring a .bdrv_create feels wrong. We are not
 creating a new LUN here but just opening an already existing LUN.

 The problem is that bdrv_create is overloaded to mean both create the
 backing storage and format the image.  Only the latter applies to
 iSCSI and, in general, as far as protocols are concerned bdrv_create is
 usually a no-op.

 However, Kevin said he prefers to have an explicit override of
 bdrv_create for iSCSI.  Can you implement that?  (I'll then do the same
 for NBD).


Yepp. Since there is consensus.
We can do a .bdrv_create for iscsi.c

regards
ronnie sahlberg

[Qemu-devel] [PATCH] SCSI disk, set HISUP bit in INQ data

2012-09-14 Thread Ronnie Sahlberg

Paolo, List

Please find a trivial patch that make SCSI-DISK set the HISUP bit in the INQ 
data.
Since I think all LUN numbers are reported as SAM4 describes, we should set 
this bit.

Its already sed in SCSI-BUS


regards
ronnie sahlberg

[Qemu-devel] [PATCH] SCSI: Standard INQUIRY data should report HiSup flag as set.

2012-09-14 Thread Ronnie Sahlberg

QEMU as far as I know only reports LUN numbers using the modes that
are described in SAM4.
As such, since all LUN numbers generated by the SCSI emulation in QEMU
follow SAM4, we should set the HiSup bit in the standard INQUIRY data
to indicate such.

From SAM4:
  4.6.3 LUNs overview
  All LUN formats described in this standard are hierarchical in
  structure even when only a single level in that hierarchy is used.
  The HISUP bit shall be set to one in the standard INQUIRY data
  (see SPC-4) when any LUN format described in this standard is used.
  Non-hierarchical formats are outside the scope of this standard.

Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
---
 hw/scsi-disk.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index 1585683..52bc062 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -678,7 +678,7 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
  * is actually implemented, but we're good enough.
  */
 outbuf[2] = 5;
-outbuf[3] = 2; /* Format 2 */
+outbuf[3] = 2 | 0x10; /* Format 2, HiSup */
 
 if (buflen  36) {
 outbuf[4] = buflen - 5; /* Additional Length = (Len - 1) - 4 */
-- 
1.7.3.1

Re: [Qemu-devel] [PATCH] iSCSI: We dont need to explicitely call qemu_notify_event() any more

2012-09-07 Thread ronnie sahlberg

Ping?

On Thu, Aug 30, 2012 at 4:56 PM, Ronnie Sahlberg
ronniesahlb...@gmail.com wrote:
 We no longer need to explicitely call qemu_notify_event() any more since this 
 is now done automatically any time the filehandles we listen to change.

 Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
 ---
  block/iscsi.c |6 --
  1 files changed, 0 insertions(+), 6 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index 0b96165..355ce65 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -167,12 +167,6 @@ iscsi_set_events(IscsiLun *iscsilun)

  }

 -/* If we just added an event, the callback might be delayed
 - * unless we call qemu_notify_event().
 - */
 -if (ev  ~iscsilun-events) {
 -qemu_notify_event();
 -}
  iscsilun-events = ev;
  }

 --
 1.7.3.1

Re: [Qemu-devel] [PATCH] iSCSI: We need to support SG_IO also from iscsi_ioctl()

2012-09-07 Thread ronnie sahlberg

ping?

On Thu, Aug 30, 2012 at 5:28 PM, Ronnie Sahlberg
ronniesahlb...@gmail.com wrote:
 We need to support SG_IO from the synchronous iscsi_ioctl() since
 scsi-block uses this to do an INQ to the device to discover its properties
 This patch makes scsi-block work with iscsi.

 Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
 ---
  block/iscsi.c |   20 +++-
  1 files changed, 19 insertions(+), 1 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index 355ce65..189ab6f 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -537,7 +537,8 @@ iscsi_aio_ioctl_cb(struct iscsi_context *iscsi, int 
 status,

  #define SG_ERR_DRIVER_SENSE0x08

 -if (status == SCSI_STATUS_CHECK_CONDITION  acb-task-datain.size = 
 2) {
 +if (status == SCSI_STATUS_CHECK_CONDITION
 + acb-task-datain.size = 2) {
  int ss;

  acb-ioh-driver_status |= SG_ERR_DRIVER_SENSE;
 @@ -622,9 +623,17 @@ static BlockDriverAIOCB 
 *iscsi_aio_ioctl(BlockDriverState *bs,
  return acb-common;
  }

 +
 +static void ioctl_cb(void *opaque, int status)
 +{
 +int *p_status = opaque;
 +*p_status = status;
 +}
 +
  static int iscsi_ioctl(BlockDriverState *bs, unsigned long int req, void 
 *buf)
  {
  IscsiLun *iscsilun = bs-opaque;
 +int status;

  switch (req) {
  case SG_GET_VERSION_NUM:
 @@ -633,6 +642,15 @@ static int iscsi_ioctl(BlockDriverState *bs, unsigned 
 long int req, void *buf)
  case SG_GET_SCSI_ID:
  ((struct sg_scsi_id *)buf)-scsi_type = iscsilun-type;
  break;
 +case SG_IO:
 +status = -EINPROGRESS;
 +iscsi_aio_ioctl(bs, req, buf, ioctl_cb, status);
 +
 +while (status == -EINPROGRESS) {
 +qemu_aio_wait();
 +}
 +
 +return 0;
  default:
  return -1;
  }
 --
 1.7.3.1

Re: [Qemu-devel] [PATCH v6 2/2] block: Support GlusterFS as a QEMU block backend

2012-09-06 Thread ronnie sahlberg

On Fri, Sep 7, 2012 at 1:47 AM, Daniel P. Berrange berra...@redhat.comwrote:

 On Thu, Sep 06, 2012 at 09:10:04PM +0530, Bharata B Rao wrote:
  On Thu, Sep 06, 2012 at 11:29:36AM +0300, Avi Kivity wrote:
   On 08/14/2012 12:58 PM, Kevin Wolf wrote:
   
While we are at this, let me bring out another issue. Gluster
 supports 3
transport types:
   
- socket in which case the server will be hostname, ipv4 or ipv4
 address.
- rdma in which case server will be interpreted similar to socket.
- unix in which case server will be a path to unix domain socket
 and this
  will look like any other filesystem path. (Eg.
 /tmp/glusterd.socket)
   
I don't think we can fit 'unix' within the standard URI scheme (RFC
 3986)
easily, but I am planning to specify the 'unix' transport as below:
   
gluster://[/path/to/unix/domain/socket]/volname/image?transport=unix
   
i,e., I am asking the user to put the unix domain socket path within
square brackets when transport type is unix.
   
Do you think this is fine ?
   
Never saw something like this before, but it does seem reasonable to
 me.
Excludes ] from the valid characters in the file name of the socket,
 but
that shouldn't be a problem in practice.
  
   Bikeshedding, but I prefer
  
 gluster:///path/to/unix/domain/socket:/volname/image?transport=unix
 
  So if the unix domain socket is /tmp/glusterd.socket, then this would
 look like:
 
  gluster:///tmp/glusterd.socket:/volname/image?transport=unix.
 
  So you are saying : will the separator b/n the unix domain socket path
  and rest of the URI components.
 
  Unless you or others strongly feel about this, I would like to go with
  [ ] based spec, which I feel is less prone to errors like missing a colon
  by mistake :)
 
  
   as being more similar to file://, or even
  
 gluster:///path/to/unix/domain/socket/volname/image?transport=unix
  
   with the last two components implied to be part of the payload, not the
   path.
 
  Note that image is a file path by itself like /dir1/a.img. So I guess it
  becomes difficult to figure out where the unix domain socket path ends
  and rest of the URI components begin w/o a separator in between.

 IMHO this is all gross. URIs already have a well defined way to provide
 multiple parameters, dealing with escaping of special characters. ie query
 parameters. The whole benefit of using URI syntax is to let apps process
 the URIs using standard APIs. We should not be trying to define some extra
 magic encoding to let us stuff 2 separate parameters into the path
 component
 since that means apps have to now write custom parsing code again. Either
 the UNIX socket path, or the volume path should be in the URI path, not
 both. The other part should be a URI parameter. I'd really expect us to
 use:

   gluster:///volname/image?transport=unixsockpath=/path/to/unix/sock


+1

ronnie sahlberg

[Qemu-devel] [PATCH] iSCSI: remove the now obsolete call to qemu_notify_event()

2012-08-30 Thread Ronnie Sahlberg

Paolo, List

Please find a trivial patch to iscsi that removes the call to 
qemu_notify_event()
We do not need to call this any more since a recent patch that made
qemu_aio_set_fd_handler() and friends invoke this call automatically when 
required.

regards
ronnie sahlberg

[Qemu-devel] [PATCH] iSCSI: We dont need to explicitely call qemu_notify_event() any more

2012-08-30 Thread Ronnie Sahlberg

We no longer need to explicitely call qemu_notify_event() any more since this 
is now done automatically any time the filehandles we listen to change.

Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
---
 block/iscsi.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 0b96165..355ce65 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -167,12 +167,6 @@ iscsi_set_events(IscsiLun *iscsilun)
 
 }
 
-/* If we just added an event, the callback might be delayed
- * unless we call qemu_notify_event().
- */
-if (ev  ~iscsilun-events) {
-qemu_notify_event();
-}
 iscsilun-events = ev;
 }
 
-- 
1.7.3.1

[Qemu-devel] [PATCH] iSCSI: Add support for SG_IO to iscsi_ioctl()

2012-08-30 Thread Ronnie Sahlberg

Paolo, List

Please find a small patch that adds support for SG_IO to the synchronous 
iscsi_ioctl() function.
This is needed to make scsi-block work with iscsi.

regards
ronnie sahlberg

[Qemu-devel] [PATCH] iSCSI: We need to support SG_IO also from iscsi_ioctl()

2012-08-30 Thread Ronnie Sahlberg

We need to support SG_IO from the synchronous iscsi_ioctl() since
scsi-block uses this to do an INQ to the device to discover its properties
This patch makes scsi-block work with iscsi.

Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
---
 block/iscsi.c |   20 +++-
 1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 355ce65..189ab6f 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -537,7 +537,8 @@ iscsi_aio_ioctl_cb(struct iscsi_context *iscsi, int status,
 
 #define SG_ERR_DRIVER_SENSE0x08
 
-if (status == SCSI_STATUS_CHECK_CONDITION  acb-task-datain.size = 2) {
+if (status == SCSI_STATUS_CHECK_CONDITION
+ acb-task-datain.size = 2) {
 int ss;
 
 acb-ioh-driver_status |= SG_ERR_DRIVER_SENSE;
@@ -622,9 +623,17 @@ static BlockDriverAIOCB *iscsi_aio_ioctl(BlockDriverState 
*bs,
 return acb-common;
 }
 
+
+static void ioctl_cb(void *opaque, int status)
+{
+int *p_status = opaque;
+*p_status = status;
+}
+
 static int iscsi_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
 {
 IscsiLun *iscsilun = bs-opaque;
+int status;
 
 switch (req) {
 case SG_GET_VERSION_NUM:
@@ -633,6 +642,15 @@ static int iscsi_ioctl(BlockDriverState *bs, unsigned long 
int req, void *buf)
 case SG_GET_SCSI_ID:
 ((struct sg_scsi_id *)buf)-scsi_type = iscsilun-type;
 break;
+case SG_IO:
+status = -EINPROGRESS;
+iscsi_aio_ioctl(bs, req, buf, ioctl_cb, status);
+
+while (status == -EINPROGRESS) {
+qemu_aio_wait();
+}
+
+return 0;
 default:
 return -1;
 }
-- 
1.7.3.1

Re: [Qemu-devel] [PATCH v2 1/3] net: asynchronous send/receive infrastructure for net/socket.c

2012-08-21 Thread ronnie sahlberg

In
net_socket_update_fd_handler()

shouldnt you call qemu_notify_event() if any of the handlers have
changed from NULL to non-NULL ?
or else it might take a while before the change takes effect.


regards
ronnie sahlberg

On Wed, Aug 22, 2012 at 1:52 AM, Stefan Hajnoczi
stefa...@linux.vnet.ibm.com wrote:
 The net/socket.c net client is not truly asynchronous.  This patch
 borrows the qemu_set_fd_handler2() code from net/tap.c as the basis for
 proper asynchronous send/receive.

 Only read packets from the socket when the peer is able to receive.
 This avoids needless queuing.

 Later patches implement asynchronous send.

 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 ---
  net/socket.c |   58 
 --
  1 file changed, 52 insertions(+), 6 deletions(-)

 diff --git a/net/socket.c b/net/socket.c
 index c172c24..54e32f0 100644
 --- a/net/socket.c
 +++ b/net/socket.c
 @@ -42,9 +42,51 @@ typedef struct NetSocketState {
  unsigned int packet_len;
  uint8_t buf[4096];
  struct sockaddr_in dgram_dst; /* contains inet host and port destination 
 iff connectionless (SOCK_DGRAM) */
 +IOHandler *send_fn;   /* differs between SOCK_STREAM/SOCK_DGRAM 
 */
 +bool read_poll;   /* waiting to receive data? */
 +bool write_poll;  /* waiting to transmit data? */
  } NetSocketState;

  static void net_socket_accept(void *opaque);
 +static void net_socket_writable(void *opaque);
 +
 +/* Only read packets from socket when peer can receive them */
 +static int net_socket_can_send(void *opaque)
 +{
 +NetSocketState *s = opaque;
 +
 +return qemu_can_send_packet(s-nc);
 +}
 +
 +static void net_socket_update_fd_handler(NetSocketState *s)
 +{
 +qemu_set_fd_handler2(s-fd,
 + s-read_poll  ? net_socket_can_send : NULL,
 + s-read_poll  ? s-send_fn : NULL,
 + s-write_poll ? net_socket_writable : NULL,
 + s);
 +}
 +
 +static void net_socket_read_poll(NetSocketState *s, bool enable)
 +{
 +s-read_poll = enable;
 +net_socket_update_fd_handler(s);
 +}
 +
 +static void net_socket_write_poll(NetSocketState *s, bool enable)
 +{
 +s-write_poll = enable;
 +net_socket_update_fd_handler(s);
 +}
 +
 +static void net_socket_writable(void *opaque)
 +{
 +NetSocketState *s = opaque;
 +
 +net_socket_write_poll(s, false);
 +
 +qemu_flush_queued_packets(s-nc);
 +}

  /* XXX: we consider we can send the whole packet without blocking */
  static ssize_t net_socket_receive(NetClientState *nc, const uint8_t *buf, 
 size_t size)
 @@ -81,7 +123,8 @@ static void net_socket_send(void *opaque)
  } else if (size == 0) {
  /* end of connection */
  eoc:
 -qemu_set_fd_handler(s-fd, NULL, NULL, NULL);
 +net_socket_read_poll(s, false);
 +net_socket_write_poll(s, false);
  if (s-listen_fd != -1) {
  qemu_set_fd_handler(s-listen_fd, net_socket_accept, NULL, s);
  }
 @@ -152,7 +195,8 @@ static void net_socket_send_dgram(void *opaque)
  return;
  if (size == 0) {
  /* end of connection */
 -qemu_set_fd_handler(s-fd, NULL, NULL, NULL);
 +net_socket_read_poll(s, false);
 +net_socket_write_poll(s, false);
  return;
  }
  qemu_send_packet(s-nc, s-buf, size);
 @@ -243,7 +287,8 @@ static void net_socket_cleanup(NetClientState *nc)
  {
  NetSocketState *s = DO_UPCAST(NetSocketState, nc, nc);
  if (s-fd != -1) {
 -qemu_set_fd_handler(s-fd, NULL, NULL, NULL);
 +net_socket_read_poll(s, false);
 +net_socket_write_poll(s, false);
  close(s-fd);
  s-fd = -1;
  }
 @@ -314,8 +359,8 @@ static NetSocketState 
 *net_socket_fd_init_dgram(NetClientState *peer,

  s-fd = fd;
  s-listen_fd = -1;
 -
 -qemu_set_fd_handler(s-fd, net_socket_send_dgram, NULL, s);
 +s-send_fn = net_socket_send_dgram;
 +net_socket_read_poll(s, true);

  /* mcast: save bound address as dst */
  if (is_connected) {
 @@ -332,7 +377,8 @@ err:
  static void net_socket_connect(void *opaque)
  {
  NetSocketState *s = opaque;
 -qemu_set_fd_handler(s-fd, net_socket_send, NULL, s);
 +s-send_fn = net_socket_send;
 +net_socket_read_poll(s, true);
  }

  static NetClientInfo net_socket_info = {
 --
 1.7.10.4

Re: [Qemu-devel] [PATCH RFT 0/3] iscsi: fix NULL dereferences / races between task completion and abort

2012-08-20 Thread ronnie sahlberg

On Mon, Aug 20, 2012 at 6:12 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi Ronnie,

 Am 20.08.2012 10:08, schrieb Paolo Bonzini:

 That's because the big QEMU lock is held by the thread that called
 qemu_aio_cancel.

 and i also see
 no cancellation message in kernel log.


 And that's because the UNMAP actually ultimately succeeds.  You'll
 probably see soft lockup messages though.

 The solution here is to bump the timeout of the UNMAP command (either in
 the kernel or in libiscsi, I didn't really understand who's at fault).


 What's your suggestion / idea about that?


There are no timeouts in libiscsi itself.
But you can probably tweak it through the guest kernel :


$ cat /sys/bus/scsi/devices/0\:0\:0\:0/timeout
30

should be the default scsi timeout for this device, and

$ cat /sys/bus/scsi/devices/0\:0\:0\:0/queue_depth
31

would be how many concurrent i/o that the guest will drive to the device.


When performing the UNMAPS, maybe setting timeout to something really
big, and at the same time also dropping queue-depth to something
really small so that the guest kernel will not be able to send so many
UNMAPs concurrently.


ronnie s

[Qemu-devel] [PATCH] iSCSI: Add support for SG_IO to bdrv_ioctl()/iscsi_ioctl()

2012-08-20 Thread Ronnie Sahlberg

Paolo, List

Please find a patch that adds emulation of SG_IO to the synchronous function
bdrv_ioctl()/iscsi_ioctl().
Previously we have only supported emulation for this ioctl in iSCSI for the
asynchronous function iscsi_aio_ioctl() since that is the only place 
scsi-generic uses SG_IO.

By adding support for SG_IO to iscsi_ioctl() this makes scsi-block and 
scsi-disk work too.


Since scsi-block/scsi-disk never worked with iscsi, and only scsi-generic 
worked, this is not a new regression. So whether this should go in now or wait 
until post 1.2 is for your disgression.

regards
ronnie sahlberg

[Qemu-devel] [PATCH] iSCSI: Add support for SG_IO in bdrv_ioctl()

2012-08-20 Thread Ronnie Sahlberg

We need to support SG_IO in the synchronous bdrv_ioctl() since this
is used by scsi-block

Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
---
 block/iscsi.c |  109 -
 1 files changed, 108 insertions(+), 1 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 993a86d..9e98bfe 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -548,7 +548,8 @@ iscsi_aio_ioctl_cb(struct iscsi_context *iscsi, int status,
 
 #define SG_ERR_DRIVER_SENSE0x08
 
-if (status == SCSI_STATUS_CHECK_CONDITION  acb-task-datain.size = 2) {
+if (status == SCSI_STATUS_CHECK_CONDITION
+ acb-task-datain.size = 2) {
 int ss;
 
 acb-ioh-driver_status |= SG_ERR_DRIVER_SENSE;
@@ -633,9 +634,53 @@ static BlockDriverAIOCB *iscsi_aio_ioctl(BlockDriverState 
*bs,
 return acb-common;
 }
 
+struct IoctlTask {
+int status;
+int complete;
+sg_io_hdr_t *ioh;
+struct scsi_task *task;
+};
+
+static void
+iscsi_ioctl_cb(struct iscsi_context *iscsi, int status,
+ void *command_data, void *opaque)
+{
+struct IoctlTask *itask = opaque;
+
+if (status  0) {
+error_report(Failed to ioctl(SG_IO) to iSCSI lun. %s,
+ iscsi_get_error(iscsi));
+itask-status = -EIO;
+}
+
+itask-ioh-driver_status = 0;
+itask-ioh-host_status   = 0;
+itask-ioh-resid = 0;
+
+#define SG_ERR_DRIVER_SENSE0x08
+
+if (status == SCSI_STATUS_CHECK_CONDITION
+ itask-task-datain.size = 2) {
+int ss;
+
+itask-ioh-driver_status |= SG_ERR_DRIVER_SENSE;
+
+itask-ioh-sb_len_wr = itask-task-datain.size - 2;
+ss = (itask-ioh-mx_sb_len = itask-ioh-sb_len_wr) ?
+ itask-ioh-mx_sb_len : itask-ioh-sb_len_wr;
+memcpy(itask-ioh-sbp, itask-task-datain.data[2], ss);
+}
+
+itask-complete = 1;
+}
+
 static int iscsi_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
 {
 IscsiLun *iscsilun = bs-opaque;
+struct iscsi_context *iscsi = iscsilun-iscsi;
+struct IoctlTask itask;
+struct scsi_task *task;
+struct iscsi_data data;
 
 switch (req) {
 case SG_GET_VERSION_NUM:
@@ -644,6 +689,68 @@ static int iscsi_ioctl(BlockDriverState *bs, unsigned long 
int req, void *buf)
 case SG_GET_SCSI_ID:
 ((struct sg_scsi_id *)buf)-scsi_type = iscsilun-type;
 break;
+case SG_IO:
+itask.ioh = buf;
+task = malloc(sizeof(struct scsi_task));
+if (task == NULL) {
+error_report(iSCSI: Failed to allocate task for scsi command. %s,
+ iscsi_get_error(iscsi));
+return -1;
+}
+memset(task, 0, sizeof(struct scsi_task));
+
+switch (itask.ioh-dxfer_direction) {
+case SG_DXFER_TO_DEV:
+task-xfer_dir = SCSI_XFER_WRITE;
+break;
+case SG_DXFER_FROM_DEV:
+task-xfer_dir = SCSI_XFER_READ;
+break;
+default:
+task-xfer_dir = SCSI_XFER_NONE;
+break;
+}
+task-cdb_size = itask.ioh-cmd_len;
+memcpy(task-cdb[0], itask.ioh-cmdp, itask.ioh-cmd_len);
+task-expxferlen = itask.ioh-dxfer_len;
+
+if (task-xfer_dir == SCSI_XFER_WRITE) {
+data.data = itask.ioh-dxferp;
+data.size = itask.ioh-dxfer_len;
+}
+
+if (iscsi_scsi_command_async(iscsi, iscsilun-lun, task,
+ iscsi_ioctl_cb,
+ (task-xfer_dir == SCSI_XFER_WRITE) ?
+ data : NULL,
+ itask) != 0) {
+scsi_free_scsi_task(task);
+return -1;
+}
+
+/* tell libiscsi to read straight into the buffer we got from ioctl */
+if (task-xfer_dir == SCSI_XFER_READ) {
+scsi_task_add_data_in_buffer(task,
+ itask.ioh-dxfer_len,
+ itask.ioh-dxferp);
+}
+
+itask.complete = 0;
+itask.status   = 0;
+itask.task = task;
+while (!itask.complete) {
+iscsi_set_events(iscsilun);
+qemu_aio_wait();
+}
+scsi_free_scsi_task(task);
+
+if (itask.status != 0) {
+error_report(iSCSI: Failed to send async command to target : %s,
+ iscsi_get_error(iscsi));
+return -1;
+}
+
+return 0;
 default:
 return -1;
 }
-- 
1.7.3.1

Re: [Qemu-devel] [PATCH 2/2] ISCSI: Force scsi-generic for MMC with blank disks

2012-08-18 Thread ronnie sahlberg

On Sun, Aug 19, 2012 at 7:58 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 17/08/2012 04:36, Ronnie Sahlberg ha scritto:
 There is no bdrv_* API for the commands for burning a blank MMC disk
 so when iSCSI LUNs are specified and the LUN is a MMC device with
 0 available blocks. This is a blank disk so force scsi generic.

 This allows the guest to talk directly to the target to burn data on
 the disk.

 Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com

 What happens without the patch?  It's ok that scsi-{hd,cd} does not
 work, but do scsi-{block,generic} work without the patch?


Neither of them work, basically because in
block.c:find_image_format()

if bs-sg is not set in

 if (bs-sg || !bdrv_is_inserted(bs)) {

then we continue to

  ret = bdrv_pread(bs, 0, buf, sizeof(buf));

which fails with an error.
So this patch is basically to prevent find_image_format() from trying
to read from a blank disk.

Maybe there is a better way to do this?



regards
ronnie sahlberg

Re: [Qemu-devel] [PATCH 2/2] ISCSI: Force scsi-generic for MMC with blank disks

2012-08-18 Thread ronnie sahlberg

Ah,

This patch only affects the case when there is a blank / empty disk loaded.
It has no effect on when real *.iso images are loaded and the disk
contains data.

The use case to be able to burn to an iscsi cdrom is probably not
very urgent, so maybe it is best to delay this until post 1.2


regards
ronnie sahlberg


On Sun, Aug 19, 2012 at 8:02 AM, ronnie sahlberg
ronniesahlb...@gmail.com wrote:
 On Sun, Aug 19, 2012 at 7:58 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 17/08/2012 04:36, Ronnie Sahlberg ha scritto:
 There is no bdrv_* API for the commands for burning a blank MMC disk
 so when iSCSI LUNs are specified and the LUN is a MMC device with
 0 available blocks. This is a blank disk so force scsi generic.

 This allows the guest to talk directly to the target to burn data on
 the disk.

 Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com

 What happens without the patch?  It's ok that scsi-{hd,cd} does not
 work, but do scsi-{block,generic} work without the patch?


 Neither of them work, basically because in
 block.c:find_image_format()

 if bs-sg is not set in

  if (bs-sg || !bdrv_is_inserted(bs)) {

 then we continue to

   ret = bdrv_pread(bs, 0, buf, sizeof(buf));

 which fails with an error.
 So this patch is basically to prevent find_image_format() from trying
 to read from a blank disk.

 Maybe there is a better way to do this?



 regards
 ronnie sahlberg

Re: [Qemu-devel] [PATCH 1/2] ISCSI: Set number of blocks to 0 for blank CDROM devices

2012-08-18 Thread ronnie sahlberg

On Sun, Aug 19, 2012 at 7:57 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 17/08/2012 04:36, Ronnie Sahlberg ha scritto:
 The number of blocks of the device is used to compute the device size
 in bdrv_getlength()/iscsi_getlength().
 For MMC devices, the ReturnedLogicalBlockAddress in the READCAPACITY10
 has a special meaning when it is 0.
 In this case it does not mean that LBA 0 is the last accessible LBA,
 and thus the device has 1 readable block, but instead it means that the
 disc is blank and there are no readable blocks.

 This change ensures that when the iSCSI LUN is loaded with a blank
 DVD-R disk or similar that bdrv_getlength() will return the correct
 size of the device as 0 bytes.

 Signed-off-by: Ronnie Sahlberg ronniesahlb...@gmail.com
 ---
  block/iscsi.c |7 ++-
  1 files changed, 6 insertions(+), 1 deletions(-)

 diff --git a/block/iscsi.c b/block/iscsi.c
 index bb9cf82..fb420ea 100644
 --- a/block/iscsi.c
 +++ b/block/iscsi.c
 @@ -717,7 +717,12 @@ iscsi_readcapacity10_cb(struct iscsi_context *iscsi, 
 int status,
  }

  itask-iscsilun-block_size = rc10-block_size;
 -itask-iscsilun-num_blocks = rc10-lba + 1;
 +if (rc10-lba == 0) {
 +/* blank disk loaded */
 +itask-iscsilun-num_blocks = 0;
 +} else {
 +itask-iscsilun-num_blocks = rc10-lba + 1;
 +}
  itask-bs-total_sectors= itask-iscsilun-num_blocks *
 itask-iscsilun-block_size / 
 BDRV_SECTOR_SIZE ;



 Thanks, applied to SCSI branch.  We'll see whether this hits 1.2.

This patch only matters for the use case of having empty/blank media
loaded into an iscsi cdrom.
I doubt being able to burn data to an iscsi drive is very important or urgent,
so it is fine with me to delay until post 1.2

regards
ronnie sahlberg

1 2 3 >

1 - 100 of 287 matches

Mail list logo