Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?

2013-07-22 Thread Mark Cave-Ayland

On 18/07/13 14:44, Alexander Graf wrote:


What I would do, however, is to complete even the INPUT/OUTPUT_MORE
commands only at the end of the whole request. This is definitely
allowed behaviour, and it ensures that a memory region isn't already
reused by the OS while e.g. a write request is still running and taking
data from this memory. We should only complete the DMA command as
soon as we don't touch the memory any more.


Yes, that's the version that I described as throw away almost all of today's code 
and rewrite it :).
Keep in mind that the same DMA controller can be used for Ethernet, so coupling 
it very tightly with

 IDE doesn't sound overly appealing to me either.

I had a go at implementing this over yesterday afternoon and I managed 
to come up with something that boots both Linux and Darwin. I've 
uploaded it to github in order for you to take a look at here: 
https://github.com/mcayland/qemu/tree/mca-macio.


On the plus side, the code seems a lot more readable; but on the 
downside my heisenbug is *still* present, even in this version of the 
code (see more below). Does the revised version look good enough to you 
both? If so, I'll tidy it up a bit further and post it to the qemu-devel 
list.


In order to ensure that things aren't too tightly coupled to IDE, I've 
introduced a new ready() function pointer which is called from the 
modified DBDMA code. Even though it's currently not used for anything 
other than IDE in QEMU, I think this would enable the same controller to 
be used with Ethernet if required.


A couple of stylistic points for the patch:

1) Can I swap m-aiocb for s-bus-dma-aiocb? This seems to be the 
field used by the internal IDE core.


2) Should the bdrv_acct_start() calls be removed from 
pmac_ide_transfer()? It looks as if they are already handled in 
ide_sector_start_dma() and ide_atapi_cmd_read_dma().


Now back to the heisenbug: for Kevin's benefit, the reason I started 
looking at this is because I have a bug with Alex's existing patches in 
that Aurenlien's test Linux PPC images fails to boot with DMA timeout 
errors (see http://www.ilande.co.uk/tmp/bad-ppc-linux.png). Sadly if I 
try and debug it (by compiling with -O0 -g or adding debug printf() 
statements) then everything works as normal :/


On the plus side, with my revised version I can trigger the bug fairly 
easily by commenting out DBDMA_kick in ide_dbdma_start() and enabling 
DEBUG_IDE_ATAPI in hw/ide/internal.h. Note that while the timeout occurs 
on the CDROM for the majority of time, I've still seen it very 
occasionally on the HD device too.


With DEBUG_IDE_ATAPI enabled the tail of the debug output I can get 
without the bug disappearing looks like this:


reply: tx_size=36 elem_tx_size=0 index=0
byte_count_limit=36
status=0x58
reply: tx_size=0 elem_tx_size=0 index=36
status=0x50
ATAPI limit=0x8 packet: 46 00 00 00 00 00 00 00 08 00 00 00
reply: tx_size=8 elem_tx_size=0 index=0
byte_count_limit=8
status=0x58
reply: tx_size=0 elem_tx_size=0 index=8
status=0x50
ATAPI limit=0x14 packet: 46 01 00 00 00 00 00 00 14 00 00 00
reply: tx_size=20 elem_tx_size=0 index=0
byte_count_limit=20
status=0x58
reply: tx_size=0 elem_tx_size=0 index=20
status=0x50
ATAPI limit=0xc packet: 43 00 00 00 00 00 01 00 0c 00 00 00
reply: tx_size=12 elem_tx_size=0 index=0
byte_count_limit=12
status=0x58
reply: tx_size=0 elem_tx_size=0 index=12
status=0x50
ATAPI limit=0x14 packet: 43 00 00 00 00 00 01 00 14 00 00 00
reply: tx_size=20 elem_tx_size=0 index=0
byte_count_limit=20
status=0x58
reply: tx_size=0 elem_tx_size=0 index=20
status=0x50
ATAPI limit=0xc packet: 43 00 01 00 00 00 00 00 0c 00 00 00
reply: tx_size=12 elem_tx_size=0 index=0
byte_count_limit=12
status=0x58
reply: tx_size=0 elem_tx_size=0 index=12
status=0x50
ATAPI limit=0x20 packet: 51 00 00 00 00 00 00 00 20 00 00 00
Data ready for CD device: 20 bytes

When the bug appears, the guest doesn't bother to program the DMA 
controller to finish the transfer which makes me wonder if one of the 
IDE status flags is not being cleared correctly? Sadly if I enable 
DEBUG_IDE then of the course the delay introduced makes everything work 
again :/


Interestingly enough, the final 0x51 command packet above has s-lba set 
to -1. Given that the code tries to use s-lba as the DMA sector number, 
this is definitely a bug - what should we actually be doing in this case?



ATB,

Mark.



Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?

2013-07-18 Thread Kevin Wolf
Am 17.07.2013 um 22:12 hat Mark Cave-Ayland geschrieben:
 On 17/07/13 14:35, Kevin Wolf wrote:
 
 Okay, so I've had a quick look at that DMA controller, and it seems that
 for a complete emulation, there's no way around using a bounce buffer
 (and calling directly into the block layer instead of using
 dma-helpers.c) for the general case.
 
 You can have a fast path that is triggered if one or more directly
 following INPUT/OUTPUT commands cover the whole IDE command, and that
 creates an QEMUSGList as described above and uses dma-helpers.c to
 implement zero-copy requests. I suspect that your Darwin requests would
 actually fall into this category.
 
 Essentially I think Alex' patches are doing something similar, just not
 implementing the complete DMA controller feature set and with the
 regular slow path hacked as additional code into the fast path. So the
 code could be cleaner, it could use asynchronous block layer functions
 and handle errors, and it could be more complete, but at the end of
 the day you'd still have some fast-path zero-copy I/O and some calls
 into the block layer using bounce buffers.
 
 I think the key concept to understand here is at what point does the
 data hit the disk? From the comments in various parts of
 Darwin/Linux, it could be understood that the DMA transfers are
 between memory and the ATA drive *buffer*, so for writes especially
 there is no guarantee that they even hit the disk until some point
 in the future, unless of course the FLUSH flag is set in the control
 register.
 
 So part of me makes me think that maybe we are over-thinking this
 and we should just go with Kevin's original suggestion: what about
 if we start a new QEMUSGList for each IDE transfer, and just keep
 appending QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST
 command?
 
 Why is this valid? We can respond with a complete status for the
 intermediate INPUT_MORE/OUTPUT_MORE commands without touching the
 disk because all that guarantees is that data has been passed
 between memory and the drive *buffer* - not that it has actually hit
 the disk. And what is the point of having explicit _LAST commands if
 they aren't used to signify completion of the whole transfer between
 drive and memory?

I don't think there is even a clear relation between the DMA controller
status and whether the data is on disk or not. It's the IDE register's
job to tell the driver when a request has completed. The DMA controller
is only responsible for getting the data from the RAM to the device,
which might start doing a write only after it has received all data and
completed the DMA operation. (cf. PIO operation in core.c where the
bytes are gathered in a bounce buffer and only when the last byte
arrives, the whole sector is written out)

What I would do, however, is to complete even the INPUT/OUTPUT_MORE
commands only at the end of the whole request. This is definitely
allowed behaviour, and it ensures that a memory region isn't already
reused by the OS while e.g. a write request is still running and taking
data from this memory. We should only complete the DMA command as
soon as we don't touch the memory any more.

Kevin



Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?

2013-07-18 Thread Alexander Graf

On 18.07.2013, at 09:41, Kevin Wolf wrote:

 Am 17.07.2013 um 22:12 hat Mark Cave-Ayland geschrieben:
 On 17/07/13 14:35, Kevin Wolf wrote:
 
 Okay, so I've had a quick look at that DMA controller, and it seems that
 for a complete emulation, there's no way around using a bounce buffer
 (and calling directly into the block layer instead of using
 dma-helpers.c) for the general case.
 
 You can have a fast path that is triggered if one or more directly
 following INPUT/OUTPUT commands cover the whole IDE command, and that
 creates an QEMUSGList as described above and uses dma-helpers.c to
 implement zero-copy requests. I suspect that your Darwin requests would
 actually fall into this category.
 
 Essentially I think Alex' patches are doing something similar, just not
 implementing the complete DMA controller feature set and with the
 regular slow path hacked as additional code into the fast path. So the
 code could be cleaner, it could use asynchronous block layer functions
 and handle errors, and it could be more complete, but at the end of
 the day you'd still have some fast-path zero-copy I/O and some calls
 into the block layer using bounce buffers.
 
 I think the key concept to understand here is at what point does the
 data hit the disk? From the comments in various parts of
 Darwin/Linux, it could be understood that the DMA transfers are
 between memory and the ATA drive *buffer*, so for writes especially
 there is no guarantee that they even hit the disk until some point
 in the future, unless of course the FLUSH flag is set in the control
 register.
 
 So part of me makes me think that maybe we are over-thinking this
 and we should just go with Kevin's original suggestion: what about
 if we start a new QEMUSGList for each IDE transfer, and just keep
 appending QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST
 command?
 
 Why is this valid? We can respond with a complete status for the
 intermediate INPUT_MORE/OUTPUT_MORE commands without touching the
 disk because all that guarantees is that data has been passed
 between memory and the drive *buffer* - not that it has actually hit
 the disk. And what is the point of having explicit _LAST commands if
 they aren't used to signify completion of the whole transfer between
 drive and memory?
 
 I don't think there is even a clear relation between the DMA controller
 status and whether the data is on disk or not. It's the IDE register's
 job to tell the driver when a request has completed. The DMA controller
 is only responsible for getting the data from the RAM to the device,
 which might start doing a write only after it has received all data and
 completed the DMA operation. (cf. PIO operation in core.c where the
 bytes are gathered in a bounce buffer and only when the last byte
 arrives, the whole sector is written out)
 
 What I would do, however, is to complete even the INPUT/OUTPUT_MORE
 commands only at the end of the whole request. This is definitely
 allowed behaviour, and it ensures that a memory region isn't already
 reused by the OS while e.g. a write request is still running and taking
 data from this memory. We should only complete the DMA command as
 soon as we don't touch the memory any more.

Yes, that's the version that I described as throw away almost all of today's 
code and rewrite it :). Keep in mind that the same DMA controller can be used 
for Ethernet, so coupling it very tightly with IDE doesn't sound overly 
appealing to me either.


Alex




Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?

2013-07-18 Thread Kevin Wolf
Am 18.07.2013 um 15:44 hat Alexander Graf geschrieben:
 
 On 18.07.2013, at 09:41, Kevin Wolf wrote:
 
  Am 17.07.2013 um 22:12 hat Mark Cave-Ayland geschrieben:
  On 17/07/13 14:35, Kevin Wolf wrote:
  
  Okay, so I've had a quick look at that DMA controller, and it seems that
  for a complete emulation, there's no way around using a bounce buffer
  (and calling directly into the block layer instead of using
  dma-helpers.c) for the general case.
  
  You can have a fast path that is triggered if one or more directly
  following INPUT/OUTPUT commands cover the whole IDE command, and that
  creates an QEMUSGList as described above and uses dma-helpers.c to
  implement zero-copy requests. I suspect that your Darwin requests would
  actually fall into this category.
  
  Essentially I think Alex' patches are doing something similar, just not
  implementing the complete DMA controller feature set and with the
  regular slow path hacked as additional code into the fast path. So the
  code could be cleaner, it could use asynchronous block layer functions
  and handle errors, and it could be more complete, but at the end of
  the day you'd still have some fast-path zero-copy I/O and some calls
  into the block layer using bounce buffers.
  
  I think the key concept to understand here is at what point does the
  data hit the disk? From the comments in various parts of
  Darwin/Linux, it could be understood that the DMA transfers are
  between memory and the ATA drive *buffer*, so for writes especially
  there is no guarantee that they even hit the disk until some point
  in the future, unless of course the FLUSH flag is set in the control
  register.
  
  So part of me makes me think that maybe we are over-thinking this
  and we should just go with Kevin's original suggestion: what about
  if we start a new QEMUSGList for each IDE transfer, and just keep
  appending QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST
  command?
  
  Why is this valid? We can respond with a complete status for the
  intermediate INPUT_MORE/OUTPUT_MORE commands without touching the
  disk because all that guarantees is that data has been passed
  between memory and the drive *buffer* - not that it has actually hit
  the disk. And what is the point of having explicit _LAST commands if
  they aren't used to signify completion of the whole transfer between
  drive and memory?
  
  I don't think there is even a clear relation between the DMA controller
  status and whether the data is on disk or not. It's the IDE register's
  job to tell the driver when a request has completed. The DMA controller
  is only responsible for getting the data from the RAM to the device,
  which might start doing a write only after it has received all data and
  completed the DMA operation. (cf. PIO operation in core.c where the
  bytes are gathered in a bounce buffer and only when the last byte
  arrives, the whole sector is written out)
  
  What I would do, however, is to complete even the INPUT/OUTPUT_MORE
  commands only at the end of the whole request. This is definitely
  allowed behaviour, and it ensures that a memory region isn't already
  reused by the OS while e.g. a write request is still running and taking
  data from this memory. We should only complete the DMA command as
  soon as we don't touch the memory any more.
 
 Yes, that's the version that I described as throw away almost all of
 today's code and rewrite it :).

Well, Mark didn't ask me what's easy to implement, but what's the Right
Thing to do. :-)

 Keep in mind that the same DMA controller can be used for Ethernet, so
 coupling it very tightly with IDE doesn't sound overly appealing to me
 either.

Taking this into consideration will make it even harder, but of course,
once designed right, it's also the most useful approach.

Kevin



Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?

2013-07-17 Thread Mark Cave-Ayland

On 17/07/13 09:16, Kevin Wolf wrote:

Hi Kevin,

Thanks for the reply - CC to qemu-devel as requested.


I've been testing some of Alex Graf's patches for running Darwin
under QEMU PPC and have been experiencing some timeout problems on
block devices. My attention is drawn to this commit in particular:
https://github.com/qemu/qemu/commit/80fc95d8bdaf3392106b131a97ca701fd374489a.

The reason for this commit is that Darwin programs the DBDMA
controller to transfer data from the ATA FIFO in chunks that aren't
sector aligned, e.g. the ATA command requests 0x1 (256 sectors)
but transfers the DMA engine to transfer the data to memory as 3
chunks of 0xfffe, 0xfffe and 0x4 bytes.


I'm not familiar with how DMA works for the macio IDE device. Do you
have any pointers to specs or something?


It works by setting up a DMA descriptor table (which is a list of 
commands) which are then executed when the RUN status bit is set until 
a STOP command is reached. Things are slightly more complicated in that 
commands can have conditional branches set on them.



The one important point I'm wondering about is why you call
dma_bdrv_read() with a single 0xfffe QEMUSGList. Shouldn't it really be
called with a QEMUSGList { 0xfffe, 0xfffe, 0x4 }, which should enable
dma-helpers.c to do the right thing?


Hmmm I guess you could perhaps scan down the command list from the 
current position looking for all INPUT/OUTPUT commands until the next 
STOP command, and maybe build up a single QEMUSGList from that? I'm not 
sure exactly how robust that would be with the conditional branching 
though - Alex?



In any case, it would be good if you could prepare a (yet failing) qtest
case that demonstrates how DMA works on this controller and what the
problematic requests look like.


I'll see what I can do, however I've not really looked at qtest before 
so it could take a little time. In the meantime, you can easily see 
these transfers by booting an old Darwin installation ISO.



It seems that the DMA API dma_bdrv_read()/dma_bdrv_write() can't
handle unaligned transfers in this way, yet I think there is a
better solution for this that doesn't mix DMA/non-DMA APIs in this
manner. I'd like to try and come up with a better solution, but
there seems to be a mix of synchronous/asynchronous/co-routine block
APIs that could be used.

So my question is: what do you think is the best way to approach
solving the unaligned data access for MACIO using a DMA-friendly
API?


First, as I said above, I'd like to understand why you need to go with
unaligned values into the DMA API. Real hardware also only works on a
sector level.


The main culprit for these transfers is Darwin which limits large 
transfers to 0xfffe (see http://searchcode.com/codesearch/view/23337208 
line 382). Hence most large disk transactions get broken down into 
irregularly-sized chunks which highlights this issue.



The block layer is really designed for working with whole sectors. The
only functions dealing with byte-aligned requests are bdrv_pread/pwrite.
They are synchronous (i.e. block the vcpu while running) and writes are
slow (because they first read the whole sector, copy in the modified
part, and write out the whole sector again), so you want to avoid it.


Yeah, I figured this wasn't the most efficient way of doing it. The 
reason for asking the question was that I'm still struggling with some 
kind of timing/threading issue with Alex's work here, and I'm wondering 
if making the unaligned DMA requests more atomic by not mixing 
DMA/non-DMA/synchronous APIs will help solve the issue or not.



ATB,

Mark.



Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?

2013-07-17 Thread Alexander Graf

On 17.07.2013, at 14:52, Mark Cave-Ayland wrote:

 On 17/07/13 09:16, Kevin Wolf wrote:
 
 Hi Kevin,
 
 Thanks for the reply - CC to qemu-devel as requested.
 
 I've been testing some of Alex Graf's patches for running Darwin
 under QEMU PPC and have been experiencing some timeout problems on
 block devices. My attention is drawn to this commit in particular:
 https://github.com/qemu/qemu/commit/80fc95d8bdaf3392106b131a97ca701fd374489a.
 
 The reason for this commit is that Darwin programs the DBDMA
 controller to transfer data from the ATA FIFO in chunks that aren't
 sector aligned, e.g. the ATA command requests 0x1 (256 sectors)
 but transfers the DMA engine to transfer the data to memory as 3
 chunks of 0xfffe, 0xfffe and 0x4 bytes.
 
 I'm not familiar with how DMA works for the macio IDE device. Do you
 have any pointers to specs or something?
 
 It works by setting up a DMA descriptor table (which is a list of commands) 
 which are then executed when the RUN status bit is set until a STOP command 
 is reached. Things are slightly more complicated in that commands can have 
 conditional branches set on them.
 
 The one important point I'm wondering about is why you call
 dma_bdrv_read() with a single 0xfffe QEMUSGList. Shouldn't it really be
 called with a QEMUSGList { 0xfffe, 0xfffe, 0x4 }, which should enable
 dma-helpers.c to do the right thing?
 
 Hmmm I guess you could perhaps scan down the command list from the current 
 position looking for all INPUT/OUTPUT commands until the next STOP command, 
 and maybe build up a single QEMUSGList from that? I'm not sure exactly how 
 robust that would be with the conditional branching though - Alex?

It'd at least be vastly different from how real hardware works, yes. We'd 
basically have to throw away the current interpretation code and instead 
emulate the device based on assumptions.

 
 In any case, it would be good if you could prepare a (yet failing) qtest
 case that demonstrates how DMA works on this controller and what the
 problematic requests look like.
 
 I'll see what I can do, however I've not really looked at qtest before so it 
 could take a little time. In the meantime, you can easily see these transfers 
 by booting an old Darwin installation ISO.
 
 It seems that the DMA API dma_bdrv_read()/dma_bdrv_write() can't
 handle unaligned transfers in this way, yet I think there is a
 better solution for this that doesn't mix DMA/non-DMA APIs in this
 manner. I'd like to try and come up with a better solution, but
 there seems to be a mix of synchronous/asynchronous/co-routine block
 APIs that could be used.
 
 So my question is: what do you think is the best way to approach
 solving the unaligned data access for MACIO using a DMA-friendly
 API?
 
 First, as I said above, I'd like to understand why you need to go with
 unaligned values into the DMA API. Real hardware also only works on a
 sector level.
 
 The main culprit for these transfers is Darwin which limits large transfers 
 to 0xfffe (see http://searchcode.com/codesearch/view/23337208 line 382). 
 Hence most large disk transactions get broken down into irregularly-sized 
 chunks which highlights this issue.

The main issue is that we're dealing with 3 separate pieces of hardware here. 
There is the IDE controller which works on sector level. And then there's the 
DMA controller which fetches data from the IDE controller byte-wise (from what 
I understand). Both work independently, but we try to shoehorn both into the 
same callback.

 
 The block layer is really designed for working with whole sectors. The
 only functions dealing with byte-aligned requests are bdrv_pread/pwrite.
 They are synchronous (i.e. block the vcpu while running) and writes are
 slow (because they first read the whole sector, copy in the modified
 part, and write out the whole sector again), so you want to avoid it.
 
 Yeah, I figured this wasn't the most efficient way of doing it. The reason 
 for asking the question was that I'm still struggling with some kind of 
 timing/threading issue with Alex's work here, and I'm wondering if making the 
 unaligned DMA requests more atomic by not mixing DMA/non-DMA/synchronous 
 APIs will help solve the issue or not.

I don't think it'll make a difference. But I'd be more than happy to design 
this more properly too - the current code is vastly ugly.


Alex




Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?

2013-07-17 Thread Mark Cave-Ayland

On 17/07/13 14:35, Kevin Wolf wrote:


Okay, so I've had a quick look at that DMA controller, and it seems that
for a complete emulation, there's no way around using a bounce buffer
(and calling directly into the block layer instead of using
dma-helpers.c) for the general case.

You can have a fast path that is triggered if one or more directly
following INPUT/OUTPUT commands cover the whole IDE command, and that
creates an QEMUSGList as described above and uses dma-helpers.c to
implement zero-copy requests. I suspect that your Darwin requests would
actually fall into this category.

Essentially I think Alex' patches are doing something similar, just not
implementing the complete DMA controller feature set and with the
regular slow path hacked as additional code into the fast path. So the
code could be cleaner, it could use asynchronous block layer functions
and handle errors, and it could be more complete, but at the end of
the day you'd still have some fast-path zero-copy I/O and some calls
into the block layer using bounce buffers.


I think the key concept to understand here is at what point does the 
data hit the disk? From the comments in various parts of Darwin/Linux, 
it could be understood that the DMA transfers are between memory and the 
ATA drive *buffer*, so for writes especially there is no guarantee that 
they even hit the disk until some point in the future, unless of course 
the FLUSH flag is set in the control register.


So part of me makes me think that maybe we are over-thinking this and we 
should just go with Kevin's original suggestion: what about if we start 
a new QEMUSGList for each IDE transfer, and just keep appending 
QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST command?


Why is this valid? We can respond with a complete status for the 
intermediate INPUT_MORE/OUTPUT_MORE commands without touching the disk 
because all that guarantees is that data has been passed between memory 
and the drive *buffer* - not that it has actually hit the disk. And what 
is the point of having explicit _LAST commands if they aren't used to 
signify completion of the whole transfer between drive and memory?



ATB,

Mark.