Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?
On 18/07/13 14:44, Alexander Graf wrote: What I would do, however, is to complete even the INPUT/OUTPUT_MORE commands only at the end of the whole request. This is definitely allowed behaviour, and it ensures that a memory region isn't already reused by the OS while e.g. a write request is still running and taking data from this memory. We should only complete the DMA command as soon as we don't touch the memory any more. Yes, that's the version that I described as throw away almost all of today's code and rewrite it :). Keep in mind that the same DMA controller can be used for Ethernet, so coupling it very tightly with IDE doesn't sound overly appealing to me either. I had a go at implementing this over yesterday afternoon and I managed to come up with something that boots both Linux and Darwin. I've uploaded it to github in order for you to take a look at here: https://github.com/mcayland/qemu/tree/mca-macio. On the plus side, the code seems a lot more readable; but on the downside my heisenbug is *still* present, even in this version of the code (see more below). Does the revised version look good enough to you both? If so, I'll tidy it up a bit further and post it to the qemu-devel list. In order to ensure that things aren't too tightly coupled to IDE, I've introduced a new ready() function pointer which is called from the modified DBDMA code. Even though it's currently not used for anything other than IDE in QEMU, I think this would enable the same controller to be used with Ethernet if required. A couple of stylistic points for the patch: 1) Can I swap m-aiocb for s-bus-dma-aiocb? This seems to be the field used by the internal IDE core. 2) Should the bdrv_acct_start() calls be removed from pmac_ide_transfer()? It looks as if they are already handled in ide_sector_start_dma() and ide_atapi_cmd_read_dma(). Now back to the heisenbug: for Kevin's benefit, the reason I started looking at this is because I have a bug with Alex's existing patches in that Aurenlien's test Linux PPC images fails to boot with DMA timeout errors (see http://www.ilande.co.uk/tmp/bad-ppc-linux.png). Sadly if I try and debug it (by compiling with -O0 -g or adding debug printf() statements) then everything works as normal :/ On the plus side, with my revised version I can trigger the bug fairly easily by commenting out DBDMA_kick in ide_dbdma_start() and enabling DEBUG_IDE_ATAPI in hw/ide/internal.h. Note that while the timeout occurs on the CDROM for the majority of time, I've still seen it very occasionally on the HD device too. With DEBUG_IDE_ATAPI enabled the tail of the debug output I can get without the bug disappearing looks like this: reply: tx_size=36 elem_tx_size=0 index=0 byte_count_limit=36 status=0x58 reply: tx_size=0 elem_tx_size=0 index=36 status=0x50 ATAPI limit=0x8 packet: 46 00 00 00 00 00 00 00 08 00 00 00 reply: tx_size=8 elem_tx_size=0 index=0 byte_count_limit=8 status=0x58 reply: tx_size=0 elem_tx_size=0 index=8 status=0x50 ATAPI limit=0x14 packet: 46 01 00 00 00 00 00 00 14 00 00 00 reply: tx_size=20 elem_tx_size=0 index=0 byte_count_limit=20 status=0x58 reply: tx_size=0 elem_tx_size=0 index=20 status=0x50 ATAPI limit=0xc packet: 43 00 00 00 00 00 01 00 0c 00 00 00 reply: tx_size=12 elem_tx_size=0 index=0 byte_count_limit=12 status=0x58 reply: tx_size=0 elem_tx_size=0 index=12 status=0x50 ATAPI limit=0x14 packet: 43 00 00 00 00 00 01 00 14 00 00 00 reply: tx_size=20 elem_tx_size=0 index=0 byte_count_limit=20 status=0x58 reply: tx_size=0 elem_tx_size=0 index=20 status=0x50 ATAPI limit=0xc packet: 43 00 01 00 00 00 00 00 0c 00 00 00 reply: tx_size=12 elem_tx_size=0 index=0 byte_count_limit=12 status=0x58 reply: tx_size=0 elem_tx_size=0 index=12 status=0x50 ATAPI limit=0x20 packet: 51 00 00 00 00 00 00 00 20 00 00 00 Data ready for CD device: 20 bytes When the bug appears, the guest doesn't bother to program the DMA controller to finish the transfer which makes me wonder if one of the IDE status flags is not being cleared correctly? Sadly if I enable DEBUG_IDE then of the course the delay introduced makes everything work again :/ Interestingly enough, the final 0x51 command packet above has s-lba set to -1. Given that the code tries to use s-lba as the DMA sector number, this is definitely a bug - what should we actually be doing in this case? ATB, Mark.
Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?
Am 17.07.2013 um 22:12 hat Mark Cave-Ayland geschrieben: On 17/07/13 14:35, Kevin Wolf wrote: Okay, so I've had a quick look at that DMA controller, and it seems that for a complete emulation, there's no way around using a bounce buffer (and calling directly into the block layer instead of using dma-helpers.c) for the general case. You can have a fast path that is triggered if one or more directly following INPUT/OUTPUT commands cover the whole IDE command, and that creates an QEMUSGList as described above and uses dma-helpers.c to implement zero-copy requests. I suspect that your Darwin requests would actually fall into this category. Essentially I think Alex' patches are doing something similar, just not implementing the complete DMA controller feature set and with the regular slow path hacked as additional code into the fast path. So the code could be cleaner, it could use asynchronous block layer functions and handle errors, and it could be more complete, but at the end of the day you'd still have some fast-path zero-copy I/O and some calls into the block layer using bounce buffers. I think the key concept to understand here is at what point does the data hit the disk? From the comments in various parts of Darwin/Linux, it could be understood that the DMA transfers are between memory and the ATA drive *buffer*, so for writes especially there is no guarantee that they even hit the disk until some point in the future, unless of course the FLUSH flag is set in the control register. So part of me makes me think that maybe we are over-thinking this and we should just go with Kevin's original suggestion: what about if we start a new QEMUSGList for each IDE transfer, and just keep appending QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST command? Why is this valid? We can respond with a complete status for the intermediate INPUT_MORE/OUTPUT_MORE commands without touching the disk because all that guarantees is that data has been passed between memory and the drive *buffer* - not that it has actually hit the disk. And what is the point of having explicit _LAST commands if they aren't used to signify completion of the whole transfer between drive and memory? I don't think there is even a clear relation between the DMA controller status and whether the data is on disk or not. It's the IDE register's job to tell the driver when a request has completed. The DMA controller is only responsible for getting the data from the RAM to the device, which might start doing a write only after it has received all data and completed the DMA operation. (cf. PIO operation in core.c where the bytes are gathered in a bounce buffer and only when the last byte arrives, the whole sector is written out) What I would do, however, is to complete even the INPUT/OUTPUT_MORE commands only at the end of the whole request. This is definitely allowed behaviour, and it ensures that a memory region isn't already reused by the OS while e.g. a write request is still running and taking data from this memory. We should only complete the DMA command as soon as we don't touch the memory any more. Kevin
Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?
On 18.07.2013, at 09:41, Kevin Wolf wrote: Am 17.07.2013 um 22:12 hat Mark Cave-Ayland geschrieben: On 17/07/13 14:35, Kevin Wolf wrote: Okay, so I've had a quick look at that DMA controller, and it seems that for a complete emulation, there's no way around using a bounce buffer (and calling directly into the block layer instead of using dma-helpers.c) for the general case. You can have a fast path that is triggered if one or more directly following INPUT/OUTPUT commands cover the whole IDE command, and that creates an QEMUSGList as described above and uses dma-helpers.c to implement zero-copy requests. I suspect that your Darwin requests would actually fall into this category. Essentially I think Alex' patches are doing something similar, just not implementing the complete DMA controller feature set and with the regular slow path hacked as additional code into the fast path. So the code could be cleaner, it could use asynchronous block layer functions and handle errors, and it could be more complete, but at the end of the day you'd still have some fast-path zero-copy I/O and some calls into the block layer using bounce buffers. I think the key concept to understand here is at what point does the data hit the disk? From the comments in various parts of Darwin/Linux, it could be understood that the DMA transfers are between memory and the ATA drive *buffer*, so for writes especially there is no guarantee that they even hit the disk until some point in the future, unless of course the FLUSH flag is set in the control register. So part of me makes me think that maybe we are over-thinking this and we should just go with Kevin's original suggestion: what about if we start a new QEMUSGList for each IDE transfer, and just keep appending QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST command? Why is this valid? We can respond with a complete status for the intermediate INPUT_MORE/OUTPUT_MORE commands without touching the disk because all that guarantees is that data has been passed between memory and the drive *buffer* - not that it has actually hit the disk. And what is the point of having explicit _LAST commands if they aren't used to signify completion of the whole transfer between drive and memory? I don't think there is even a clear relation between the DMA controller status and whether the data is on disk or not. It's the IDE register's job to tell the driver when a request has completed. The DMA controller is only responsible for getting the data from the RAM to the device, which might start doing a write only after it has received all data and completed the DMA operation. (cf. PIO operation in core.c where the bytes are gathered in a bounce buffer and only when the last byte arrives, the whole sector is written out) What I would do, however, is to complete even the INPUT/OUTPUT_MORE commands only at the end of the whole request. This is definitely allowed behaviour, and it ensures that a memory region isn't already reused by the OS while e.g. a write request is still running and taking data from this memory. We should only complete the DMA command as soon as we don't touch the memory any more. Yes, that's the version that I described as throw away almost all of today's code and rewrite it :). Keep in mind that the same DMA controller can be used for Ethernet, so coupling it very tightly with IDE doesn't sound overly appealing to me either. Alex
Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?
Am 18.07.2013 um 15:44 hat Alexander Graf geschrieben: On 18.07.2013, at 09:41, Kevin Wolf wrote: Am 17.07.2013 um 22:12 hat Mark Cave-Ayland geschrieben: On 17/07/13 14:35, Kevin Wolf wrote: Okay, so I've had a quick look at that DMA controller, and it seems that for a complete emulation, there's no way around using a bounce buffer (and calling directly into the block layer instead of using dma-helpers.c) for the general case. You can have a fast path that is triggered if one or more directly following INPUT/OUTPUT commands cover the whole IDE command, and that creates an QEMUSGList as described above and uses dma-helpers.c to implement zero-copy requests. I suspect that your Darwin requests would actually fall into this category. Essentially I think Alex' patches are doing something similar, just not implementing the complete DMA controller feature set and with the regular slow path hacked as additional code into the fast path. So the code could be cleaner, it could use asynchronous block layer functions and handle errors, and it could be more complete, but at the end of the day you'd still have some fast-path zero-copy I/O and some calls into the block layer using bounce buffers. I think the key concept to understand here is at what point does the data hit the disk? From the comments in various parts of Darwin/Linux, it could be understood that the DMA transfers are between memory and the ATA drive *buffer*, so for writes especially there is no guarantee that they even hit the disk until some point in the future, unless of course the FLUSH flag is set in the control register. So part of me makes me think that maybe we are over-thinking this and we should just go with Kevin's original suggestion: what about if we start a new QEMUSGList for each IDE transfer, and just keep appending QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST command? Why is this valid? We can respond with a complete status for the intermediate INPUT_MORE/OUTPUT_MORE commands without touching the disk because all that guarantees is that data has been passed between memory and the drive *buffer* - not that it has actually hit the disk. And what is the point of having explicit _LAST commands if they aren't used to signify completion of the whole transfer between drive and memory? I don't think there is even a clear relation between the DMA controller status and whether the data is on disk or not. It's the IDE register's job to tell the driver when a request has completed. The DMA controller is only responsible for getting the data from the RAM to the device, which might start doing a write only after it has received all data and completed the DMA operation. (cf. PIO operation in core.c where the bytes are gathered in a bounce buffer and only when the last byte arrives, the whole sector is written out) What I would do, however, is to complete even the INPUT/OUTPUT_MORE commands only at the end of the whole request. This is definitely allowed behaviour, and it ensures that a memory region isn't already reused by the OS while e.g. a write request is still running and taking data from this memory. We should only complete the DMA command as soon as we don't touch the memory any more. Yes, that's the version that I described as throw away almost all of today's code and rewrite it :). Well, Mark didn't ask me what's easy to implement, but what's the Right Thing to do. :-) Keep in mind that the same DMA controller can be used for Ethernet, so coupling it very tightly with IDE doesn't sound overly appealing to me either. Taking this into consideration will make it even harder, but of course, once designed right, it's also the most useful approach. Kevin
Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?
On 17/07/13 09:16, Kevin Wolf wrote: Hi Kevin, Thanks for the reply - CC to qemu-devel as requested. I've been testing some of Alex Graf's patches for running Darwin under QEMU PPC and have been experiencing some timeout problems on block devices. My attention is drawn to this commit in particular: https://github.com/qemu/qemu/commit/80fc95d8bdaf3392106b131a97ca701fd374489a. The reason for this commit is that Darwin programs the DBDMA controller to transfer data from the ATA FIFO in chunks that aren't sector aligned, e.g. the ATA command requests 0x1 (256 sectors) but transfers the DMA engine to transfer the data to memory as 3 chunks of 0xfffe, 0xfffe and 0x4 bytes. I'm not familiar with how DMA works for the macio IDE device. Do you have any pointers to specs or something? It works by setting up a DMA descriptor table (which is a list of commands) which are then executed when the RUN status bit is set until a STOP command is reached. Things are slightly more complicated in that commands can have conditional branches set on them. The one important point I'm wondering about is why you call dma_bdrv_read() with a single 0xfffe QEMUSGList. Shouldn't it really be called with a QEMUSGList { 0xfffe, 0xfffe, 0x4 }, which should enable dma-helpers.c to do the right thing? Hmmm I guess you could perhaps scan down the command list from the current position looking for all INPUT/OUTPUT commands until the next STOP command, and maybe build up a single QEMUSGList from that? I'm not sure exactly how robust that would be with the conditional branching though - Alex? In any case, it would be good if you could prepare a (yet failing) qtest case that demonstrates how DMA works on this controller and what the problematic requests look like. I'll see what I can do, however I've not really looked at qtest before so it could take a little time. In the meantime, you can easily see these transfers by booting an old Darwin installation ISO. It seems that the DMA API dma_bdrv_read()/dma_bdrv_write() can't handle unaligned transfers in this way, yet I think there is a better solution for this that doesn't mix DMA/non-DMA APIs in this manner. I'd like to try and come up with a better solution, but there seems to be a mix of synchronous/asynchronous/co-routine block APIs that could be used. So my question is: what do you think is the best way to approach solving the unaligned data access for MACIO using a DMA-friendly API? First, as I said above, I'd like to understand why you need to go with unaligned values into the DMA API. Real hardware also only works on a sector level. The main culprit for these transfers is Darwin which limits large transfers to 0xfffe (see http://searchcode.com/codesearch/view/23337208 line 382). Hence most large disk transactions get broken down into irregularly-sized chunks which highlights this issue. The block layer is really designed for working with whole sectors. The only functions dealing with byte-aligned requests are bdrv_pread/pwrite. They are synchronous (i.e. block the vcpu while running) and writes are slow (because they first read the whole sector, copy in the modified part, and write out the whole sector again), so you want to avoid it. Yeah, I figured this wasn't the most efficient way of doing it. The reason for asking the question was that I'm still struggling with some kind of timing/threading issue with Alex's work here, and I'm wondering if making the unaligned DMA requests more atomic by not mixing DMA/non-DMA/synchronous APIs will help solve the issue or not. ATB, Mark.
Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?
On 17.07.2013, at 14:52, Mark Cave-Ayland wrote: On 17/07/13 09:16, Kevin Wolf wrote: Hi Kevin, Thanks for the reply - CC to qemu-devel as requested. I've been testing some of Alex Graf's patches for running Darwin under QEMU PPC and have been experiencing some timeout problems on block devices. My attention is drawn to this commit in particular: https://github.com/qemu/qemu/commit/80fc95d8bdaf3392106b131a97ca701fd374489a. The reason for this commit is that Darwin programs the DBDMA controller to transfer data from the ATA FIFO in chunks that aren't sector aligned, e.g. the ATA command requests 0x1 (256 sectors) but transfers the DMA engine to transfer the data to memory as 3 chunks of 0xfffe, 0xfffe and 0x4 bytes. I'm not familiar with how DMA works for the macio IDE device. Do you have any pointers to specs or something? It works by setting up a DMA descriptor table (which is a list of commands) which are then executed when the RUN status bit is set until a STOP command is reached. Things are slightly more complicated in that commands can have conditional branches set on them. The one important point I'm wondering about is why you call dma_bdrv_read() with a single 0xfffe QEMUSGList. Shouldn't it really be called with a QEMUSGList { 0xfffe, 0xfffe, 0x4 }, which should enable dma-helpers.c to do the right thing? Hmmm I guess you could perhaps scan down the command list from the current position looking for all INPUT/OUTPUT commands until the next STOP command, and maybe build up a single QEMUSGList from that? I'm not sure exactly how robust that would be with the conditional branching though - Alex? It'd at least be vastly different from how real hardware works, yes. We'd basically have to throw away the current interpretation code and instead emulate the device based on assumptions. In any case, it would be good if you could prepare a (yet failing) qtest case that demonstrates how DMA works on this controller and what the problematic requests look like. I'll see what I can do, however I've not really looked at qtest before so it could take a little time. In the meantime, you can easily see these transfers by booting an old Darwin installation ISO. It seems that the DMA API dma_bdrv_read()/dma_bdrv_write() can't handle unaligned transfers in this way, yet I think there is a better solution for this that doesn't mix DMA/non-DMA APIs in this manner. I'd like to try and come up with a better solution, but there seems to be a mix of synchronous/asynchronous/co-routine block APIs that could be used. So my question is: what do you think is the best way to approach solving the unaligned data access for MACIO using a DMA-friendly API? First, as I said above, I'd like to understand why you need to go with unaligned values into the DMA API. Real hardware also only works on a sector level. The main culprit for these transfers is Darwin which limits large transfers to 0xfffe (see http://searchcode.com/codesearch/view/23337208 line 382). Hence most large disk transactions get broken down into irregularly-sized chunks which highlights this issue. The main issue is that we're dealing with 3 separate pieces of hardware here. There is the IDE controller which works on sector level. And then there's the DMA controller which fetches data from the IDE controller byte-wise (from what I understand). Both work independently, but we try to shoehorn both into the same callback. The block layer is really designed for working with whole sectors. The only functions dealing with byte-aligned requests are bdrv_pread/pwrite. They are synchronous (i.e. block the vcpu while running) and writes are slow (because they first read the whole sector, copy in the modified part, and write out the whole sector again), so you want to avoid it. Yeah, I figured this wasn't the most efficient way of doing it. The reason for asking the question was that I'm still struggling with some kind of timing/threading issue with Alex's work here, and I'm wondering if making the unaligned DMA requests more atomic by not mixing DMA/non-DMA/synchronous APIs will help solve the issue or not. I don't think it'll make a difference. But I'd be more than happy to design this more properly too - the current code is vastly ugly. Alex
Re: [Qemu-devel] Possibility of unaligned DMA accesses via the QEMU DMA API?
On 17/07/13 14:35, Kevin Wolf wrote: Okay, so I've had a quick look at that DMA controller, and it seems that for a complete emulation, there's no way around using a bounce buffer (and calling directly into the block layer instead of using dma-helpers.c) for the general case. You can have a fast path that is triggered if one or more directly following INPUT/OUTPUT commands cover the whole IDE command, and that creates an QEMUSGList as described above and uses dma-helpers.c to implement zero-copy requests. I suspect that your Darwin requests would actually fall into this category. Essentially I think Alex' patches are doing something similar, just not implementing the complete DMA controller feature set and with the regular slow path hacked as additional code into the fast path. So the code could be cleaner, it could use asynchronous block layer functions and handle errors, and it could be more complete, but at the end of the day you'd still have some fast-path zero-copy I/O and some calls into the block layer using bounce buffers. I think the key concept to understand here is at what point does the data hit the disk? From the comments in various parts of Darwin/Linux, it could be understood that the DMA transfers are between memory and the ATA drive *buffer*, so for writes especially there is no guarantee that they even hit the disk until some point in the future, unless of course the FLUSH flag is set in the control register. So part of me makes me think that maybe we are over-thinking this and we should just go with Kevin's original suggestion: what about if we start a new QEMUSGList for each IDE transfer, and just keep appending QEMUSGList entries until we find an OUTPUT_LAST/INPUT_LAST command? Why is this valid? We can respond with a complete status for the intermediate INPUT_MORE/OUTPUT_MORE commands without touching the disk because all that guarantees is that data has been passed between memory and the drive *buffer* - not that it has actually hit the disk. And what is the point of having explicit _LAST commands if they aren't used to signify completion of the whole transfer between drive and memory? ATB, Mark.