Re: BUG: SCSI aic7xxx driver and AMD IOMMU

2015-03-26 Thread Mark Hounschell

On 03/26/2015 11:52 AM, Joerg Roedel wrote:

Hi Mark,

On Thu, Mar 26, 2015 at 10:58:15AM -0400, Mark Hounschell wrote:

Sorry but CMA was still badly broken. I have a patch below that works.


In which way is it broken? What happens when you try to allocate memory
with dma_alloc_coherent?



I got Oops on both the alloc and free and I had to hit the reset button.


I've tested it with small (no CMA) and large (with CMA) dma's using
dma_alloc_coherent. The patch below is just the git diff from your
cloned tree piped to a file then copied into this email. If you require
an official patch I can send one. Just let me know.


The main differences I can spot are that you change the order (first
CMA, then buddy) and you manually align the input size. I can see the
reason for the later, but why does CMA need to be tried first?



I believe that is how dma_alloc_coherent works. If  CMA is present it 
uses it. Even for small allocations. If not present then it does the 
best it can. That patch was derived by looking at the Intel iommu driver 
works properly. Maybe you should take a look at that.



Also, in my opinion, this CMA thing is clearly a BUG not a feature
request. The AMD iommu clearly breaks CMA. I feel what ever fix
you are happy with should be back ported to stable.


It is not a BUG, the interface definition for dma_alloc_coherent does
not specify that it can allocate infinite amounts of memory. So this
patch does not qualify for stable.




I don't care what the the interface definition for dma_alloc_coherent 
says. If it does not describe it's use with CMA, it is outdated. Maybe 
the interface definition for dma_alloc_coherent was done before CMA 
was implemented.


Maybe you should be looking at CMA Documentation.  Unfortunately there 
does not yet seem to be any in the Documentation dir. Probably because 
CMA is supposed to be transparent to the DMA API. You know, one should 
not need to details about it, etc...


It is a BUG. To access CMA you use dma_alloc_coherent. You have 
completely highjacked that function.  You have broken CMA. PERIOD. That 
is a BUG.


Regards
Mark
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: SCSI aic7xxx driver and AMD IOMMU

2015-03-26 Thread Mark Hounschell
Hi Joerg,

On 03/26/2015 08:45 AM, Joerg Roedel wrote:
 On Wed, Mar 25, 2015 at 12:36:34PM -0400, Mark Hounschell wrote:
 BTW, so far the first 2 patches are working well. I was going to
 wait until the end of the day to report but so far I have been
 unable to produce the problems I was seeing. And I am in the middle
 of some driver work so lots of unloading/loading going on.
 
 Great, thanks. Please let me know when you have test results for the
 other patches too.
 
 
   Joerg
 
Sorry but CMA was still badly broken. I have a patch below that works.
I've tested it with small (no CMA) and large (with CMA) dma's using
dma_alloc_coherent. The patch below is just the git diff from your
cloned tree piped to a file then copied into this email. If you require
an official patch I can send one. Just let me know.

Also, in my opinion, this CMA thing is clearly a BUG not a feature
request. The AMD iommu clearly breaks CMA. I feel what ever fix
you are happy with should be back ported to stable.

Regards
Mark 

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a0197d0..5ea4fed 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2917,28 +2917,36 @@ static void *alloc_coherent(struct device *dev, size_t 
size,
u64 dma_mask = dev-coherent_dma_mask;
struct protection_domain *domain;
unsigned long flags;
-   struct page *page;
+   struct page *page = 0;
+   int order;
+   unsigned int count;
+
+   size = PAGE_ALIGN(size);
+   order = get_order(size);
+   count = size  PAGE_SHIFT;
 
INC_STATS_COUNTER(cnt_alloc_coherent);
 
domain = get_domain(dev);
if (PTR_ERR(domain) == -EINVAL) {
-   page = alloc_pages(flag, get_order(size));
+   page = alloc_pages(flag, order);
*dma_addr = page_to_phys(page);
return page_address(page);
} else if (IS_ERR(domain))
return NULL;
 
-   dma_mask  = dev-coherent_dma_mask;
flag = ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32);
+   flag |= __GFP_ZERO;
 
-   page = alloc_pages(flag, get_order(size));
+   if (flag  __GFP_WAIT) {
+   page = dma_alloc_from_contiguous(dev, count, order);
+   if (page  page_to_phys(page) + size  dma_mask) {
+   dma_release_from_contiguous(dev, page, count);
+   page = NULL;
+   }
+   }
if (!page) {
-   if (!(flag  __GFP_WAIT))
-   return NULL;
-
-   page = dma_alloc_from_contiguous(dev, size  PAGE_SHIFT,
-get_order(size));
+   page = alloc_pages(flag, order);
if (!page)
return NULL;
}
@@ -2951,7 +2959,7 @@ static void *alloc_coherent(struct device *dev, size_t 
size,
*dma_addr = __map_single(dev, domain-priv, page_to_phys(page),
 size, DMA_BIDIRECTIONAL, true, dma_mask);
 
-   if (*dma_addr == DMA_ERROR_CODE) {
+   if (!dma_addr || (*dma_addr == DMA_ERROR_CODE)) {
spin_unlock_irqrestore(domain-lock, flags);
goto out_free;
}
@@ -2965,7 +2973,7 @@ static void *alloc_coherent(struct device *dev, size_t 
size,
 out_free:
 
if (!dma_release_from_contiguous(dev, page, size  PAGE_SHIFT))
-   __free_pages(page, get_order(size));
+   __free_pages(page, order);
 
return NULL;
 }
@@ -2979,6 +2987,11 @@ static void free_coherent(struct device *dev, size_t 
size,
 {
unsigned long flags;
struct protection_domain *domain;
+   int order;
+   struct page *page = virt_to_page(virt_addr);
+
+   size = PAGE_ALIGN(size);
+   order = get_order(size);
 
INC_STATS_COUNTER(cnt_free_coherent);
 
@@ -2995,7 +3008,9 @@ static void free_coherent(struct device *dev, size_t size,
spin_unlock_irqrestore(domain-lock, flags);
 
 free_mem:
-   free_pages((unsigned long)virt_addr, get_order(size));
+   if (!dma_release_from_contiguous(dev, page, size  PAGE_SHIFT))
+   __free_pages(page, order);
+
 }
 
 /*

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: SCSI aic7xxx driver and AMD IOMMU

2015-03-25 Thread Mark Hounschell

On 03/25/2015 11:13 AM, Joerg Roedel wrote:

Hi again,

On Wed, Mar 25, 2015 at 02:59:37PM +0100, Joerg Roedel wrote:

On Tue, Mar 24, 2015 at 07:57:49AM -0400, Mark Hounschell wrote:

I'll be happy to test it.


Okay, here you go:

git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git amd-iommu


I just added more patches to implement the use of the contiguous memory
allocator for the dma_alloc_coherent path. Since you asked for it, can
you test this as well?




Sure. No problem. If I update my tree from the above, will I get it.

BTW, so far the first 2 patches are working well. I was going to wait 
until the end of the day to report but so far I have been unable to 
produce the problems I was seeing. And I am in the middle of some driver 
work so lots of unloading/loading going on.


Regards
Mark

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: SCSI aic7xxx driver and AMD IOMMU

2015-03-24 Thread Mark Hounschell

On 03/23/2015 11:03 AM, Joerg Roedel wrote:

Hi Mark,

On Tue, Mar 03, 2015 at 02:36:19PM -0500, Mark Hounschell wrote:

It looks like this problem is NOT a bug with the SCSI aic7xxx driver
after all. I can duplicate this BUG very easily with other hardware.
Simply removing a driver module  (whether it its self, has actually
used any of the DMA API or not) that is sitting on the same pci bus
as a card that is actually using DMA will cause this. And that card
that is in use and using DMA will no longer function. It looks and
feels like unloading a module causes the IOMMU to improperly unmap
valid mappings.


You are right, I looked into the code and found the problem. I'll post a
fix for testing this week.



I'll be happy to test it.

Regards
Mark

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: SCSI aic7xxx driver and AMD IOMMU

2015-03-03 Thread Mark Hounschell

Hi Joerg,

It looks like this problem is NOT a bug with the SCSI aic7xxx driver 
after all. I can duplicate this BUG very easily with other hardware. 
Simply removing a driver module  (whether it its self, has actually used 
any of the DMA API or not) that is sitting on the same pci bus as a card 
that is actually using DMA will cause this. And that card that is in use 
and using DMA will no longer function. It looks and feels like 
unloading a module causes the IOMMU to improperly unmap valid mappings.


Regards
Mark

 Kernel 3.18.7 (x86_64) with an enabled IOMMU. This happens 
immediately after

unloading the aic7xxx driver. In fact this happens unloading any module 
associated
with any pci card on the same bus as the Adaptec AHA-2930CU card. If I blacklist
the aic7xxx driver the problem never shows.  If I disable the IOMMU, the problem
never shows.

This was brought up over at io...@lists.linux-foundation.org in this thread.
http://lists.linuxfoundation.org/pipermail/iommu/2015-February/012274.html
and they think it is not an iommu problem, but a possibly a problem with the
aic7xxx driver.  Complete dmesg attached.

[  106.355752] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x 
address=0x00073980 flags=0x0070]
[  106.358732] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x 
address=0x000739a0 flags=0x0070]
[  106.361870] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x 
address=0x000739b0 flags=0x0070]
[  112.352950] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x 
address=0x00074d00 flags=0x0070]
[  112.356057] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x 
address=0x00074d20 flags=0x0070]
[  112.359142] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x 
address=0x00074d30 flags=0x0070]
[  126.588896] AMD-Vi: Using protection domain 1 for device :10:05.0
[  126.591992] [ cut here ]
[  126.595107] WARNING: CPU: 4 PID: 0 at drivers/iommu/amd_iommu.c:2637 
dma_ops_domain_unmap.part.13+0x65/0x70()
[  126.598408] Modules linked in: iscsi_ibft iscsi_boot_sysfs af_packet kvm 
crc32_pclmul crc32c_intel aesni_intel snd_hda_codec_realtek 
snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec aes_x86_64 
drm snd_hwdep snd_pcm glue_helper lrw aic79xx gf128mul ablk_helper cryptd 
snd_timer 3c59x scsi_transport_spi snd r8169 xhci_pci xhci_hcd serio_raw pcspkr 
mii tpm_infineon tpm_tis fam15h_power k10temp i2c_piix4 processor thermal_sys 
shpchp tpm soundcore 8250_fintek dm_mod sr_mod cdrom ata_generic mxm_wmi 
pata_atiixp ohci_pci wmi button sg autofs4 [last unloaded: aic7xxx]
[  126.613992] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.18.7-lcrs #1
[  126.618146] Hardware name: Gigabyte Technology Co., Ltd. To be filled by 
O.E.M./990FXA-UD5, BIOS FB 01/23/2013
[  126.622456]  0009 88044fd03cb8 807bde4d 

[  126.626815]   88044fd03cf8 8024cf7c 
88044fd03cf8
[  126.631172]  88043cdbc380  0600 
000702c0
[  126.635504] Call Trace:
[  126.639703]  IRQ  [807bde4d] dump_stack+0x4e/0x71
[  126.644003]  [8024cf7c] warn_slowpath_common+0x7c/0xa0
[  126.648335]  [8024d045] warn_slowpath_null+0x15/0x20
[  126.652632]  [806a47e5] dma_ops_domain_unmap.part.13+0x65/0x70
[  126.656919]  [806a675b] __unmap_single.isra.16+0x9b/0x100
[  126.661242]  [806a7198] unmap_page+0x48/0x70
[  126.665572]  [a0157373] boomerang_rx+0x333/0x600 [3c59x]
[  126.669968]  [a015784a] boomerang_interrupt+0x16a/0x4f0 [3c59x]
[  126.674400]  [8029600e] handle_irq_event_percpu+0x3e/0x1e0
[  126.678860]  [802961ec] handle_irq_event+0x3c/0x60
[  126.683298]  [80298bfe] handle_fasteoi_irq+0x7e/0x130
[  126.687732]  [8020543d] handle_irq+0x1d/0x30
[  126.692164]  [80204cee] do_IRQ+0x4e/0xf0
[  126.696577]  [807c596a] common_interrupt+0x6a/0x6a
[  126.700873]  EOI  [802a6e23] ? hrtimer_start+0x13/0x20
[  126.705107]  [8020cad7] ? default_idle+0x17/0x100
[  126.709386]  [8020d4ca] arch_cpu_idle+0xa/0x10
[  126.713685]  [80281dda] cpu_startup_entry+0x34a/0x380
[  126.718010]  [80281a91] ? cpu_startup_entry+0x1/0x380
[  126.722361]  [80233857] start_secondary+0x157/0x180
[  126.726731] ---[ end trace ace20602a5302afb ]---
[  128.557546] [ cut here ]
[  128.560501] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x0001 
address=0x000766c0 flags=0x0020]
[  128.560502] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x0001 
address=0x00076700 flags=0x0020]
[  128.560503] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 domain=0x0001 
address=0x00076740 flags=0x0020]
[  128.560504] AMD-Vi: Event logged [IO_PAGE_FAULT device=0f:00.0 

Re: New 2.6.24.2 SG_IO SCSI problems

2008-02-23 Thread Mark Hounschell
Mike Christie wrote:
 Mark Hounschell wrote:
 Mark Hounschell wrote:
 Mike Christie wrote:
 Mike Christie wrote:
 Mark Hounschell wrote:
 I seem to have run into some sort of regression in the SG_IO
 interface of 2.6.24.2. I have an application that up until 2.6.24
 worked fine. The 2.6.23.16 kernel works fine.

 During reads I get these kernel messages. Writes and other functions
 _seem_ OK. Actually basic
 reads  are working. Its with large BC reads using an io_vec list that
 the problem shows up.

 Are you doing SG_IO to the sg device (/dev/sg*) or to the block device
 (/dev/sdX)?
 If you are doing SG_IO to the sg device, then I know of one regression
 (well not regression exactly, but I fixed a bug but the patch got
 partially overwritten by another patch and that caused a new bug). Both
 bugs are fixed in 2.6.25-rc2. Could you try that out if you are doing
 SG_IO to the sg device.

 Yes, I'm using /dev/sg*. And yes again I'll checkout 2.6.25-rc2 ASIC.

 Thanks
 Mark
 -
  
 2.6.25-rc2 does fix the problem I'm having. I don't suppose there is a
 patch
 lying around for 2.6.24.2??

 
 I attached a backport of the patch from Tony (added as cc) that is in
 2.6.25-rc2. Could you try it out against 2.6.24.2 just to make sure it
 was this patch, then we can send it to stable.
 

Sorry it took so long. This does fix my problem. I hope it's not to late
for 2.6.24.3

Regards
Mark
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New 2.6.24.2 SG_IO SCSI problems

2008-02-22 Thread Mark Hounschell
Mark Hounschell wrote:
 James Bottomley wrote:
 On Thu, 2008-02-21 at 10:15 -0500, Mark Hounschell wrote:
 I seem to have run into some sort of regression in the SG_IO interface of 
 2.6.24.2. 
 I have an application that up until 2.6.24 worked fine. The 2.6.23.16 
 kernel works fine.

 During reads I get these kernel messages. Writes and other functions _seem_ 
 OK. Actually basic
 reads  are working. Its with large BC reads using an io_vec list that the 
 problem shows up.

 Feb 21 09:27:51 harley kernel: (scsi1:A:2:0): data overrun detected in 
 Data-in phase.  Tag == 0x1.
 Feb 21 09:27:51 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length 
 = 256.  NumSGs = 1.
 Feb 21 09:27:51 harley kernel: sg[0] - Addr 0x06256100 : Length 256
 Help me a little here.  What was the io_vec and command you sent in to
 produce this?  The aic debugging information implies a single element sg
 list for a 256 byte read.

 James



 
 Well, I did no 256 byte xfer at all.  My failing io_vec list has 6 elements.
 The first 5 are for byte counts of 0xfffc and the last 0x6114. See below.
 
 This is some debug info from within my app of the commands leading up the 
 failure:
 If you need actual values of the io_vec lists I will need to add some 
 additional debug
 info into the app. I will do if needed.
 
 The disk BTW is formated at 768 byte sector size.
  
 The first read has a 2 element io_vec list and reports no error:
 ScsiDev_thread_7e00: Read CBD = 0x08 0x00 0x00 0x00 0x01 0x00
 ScsiDev_thread_7e00: Read1(1) bc = 0x0078 addr = 0x00 Skip 0
 ScsiDev_thread_7e00: Read(2) bc = 0x0288 addr = 0xb6cea368 Skip 1
 ScsiDev_thread_7e00: SRead  DC ops = 2 short_bc = 288
 
 The second read has a 4 element io_vec list and reports no error:
 ScsiDev_thread_7e00: Read CBD = 0x08 0x00 0x00 0x00 0x05 0x00
 ScsiDev_thread_7e00: Read1(1) bc = 0x0780 addr = 0x00 Skip 0
 ScsiDev_thread_7e00: Read1(2) bc = 0x0670 addr = 0x00 Skip 1
 ScsiDev_thread_7e00: Read1(3) bc = 0x003c addr = 0x000200 Skip 0
 ScsiDev_thread_7e00: SRead(4) bc = 0x022c addr = 0xb6cea368 Skip 1
 ScsiDev_thread_7e00: Read  DC ops = 4 short_bc = 22c
 
 There is a seek here that reports no error:
 ScsiDev_thread_7e00: Seek address 2752 BPS 768
 
 
 This read has a 6 element io_vec list and reports  the error to the app:
 ScsiDev_thread_7e00: ReadX CBD = 0x28 0x00 0x00 0x00 0x27 0x52 0x00 0x01 0xcb 
 0x00
 ScsiDev_thread_7e00: Read1(1) bc = 0xfffc addr = 0x000784 Skip 0
 ScsiDev_thread_7e00: Read1(2) bc = 0xfffc addr = 0x010780 Skip 0
 ScsiDev_thread_7e00: Read1(3) bc = 0xfffc addr = 0x02077c Skip 0
 ScsiDev_thread_7e00: Read1(4) bc = 0xfffc addr = 0x030778 Skip 0
 ScsiDev_thread_7e00: Read1(5) bc = 0xfffc addr = 0x040774 Skip 0
 ScsiDev_thread_7e00: Read1(6) bc = 0x6114 addr = 0x050770 Skip 0
 ScsiDev_thread_7e00: Read  DC ops = 6 short_bc = 0
 
 ScsiDev_thread_7e00: scsi = 0x0 msg 0x0 host = 0x0007 driver = 0x
 ScsiDev_thread_7e00: Read error: sns = 0x00 residual = 0x
 ScsiDev_thread_7e00: Posting IPL status 0x0090 0x000e for Suba  
 to loc 0x00
 ScsiDev_thread_7e00: Sleeping!!
 
 
 Below is the complete dump of kernel messages for the above. I assume they are
 a result of the last failing read but there is an awfull lot there just for 
 that 6 element
 io_vec list. Sorry to put all this in there but wanted to you to get the idea.
 
 
 Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in 
 Data-in phase.  Tag == 0x1.
 Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
 256.  NumSGs = 1.
 Feb 21 10:51:03 harley kernel: sg[0] - Addr 0x03252e100 : Length 256
.
. Snip
.
 Feb 21 10:51:49 harley kernel: (scsi1:A:2:0): data overrun detected in 
 Data-in phase.  Tag == 0x1.
 Feb 21 10:51:49 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
 256.  NumSGs = 1.
 Feb 21 10:51:49 harley kernel: sg[0] - Addr 0x0f156100 : Length 256
 
 Again, sorry for all that.
 

FWIW, on a different machine doing the same thing I get only a single
kernel message and NOT the tons shown above.

Regards
Mark

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New 2.6.24.2 SG_IO SCSI problems

2008-02-22 Thread Mark Hounschell
Mike Christie wrote:
 Mike Christie wrote:
 Mark Hounschell wrote:
 I seem to have run into some sort of regression in the SG_IO
 interface of 2.6.24.2. I have an application that up until 2.6.24
 worked fine. The 2.6.23.16 kernel works fine.

 During reads I get these kernel messages. Writes and other functions
 _seem_ OK. Actually basic
 reads  are working. Its with large BC reads using an io_vec list that
 the problem shows up.


 Are you doing SG_IO to the sg device (/dev/sg*) or to the block device
 (/dev/sdX)?
 
 If you are doing SG_IO to the sg device, then I know of one regression
 (well not regression exactly, but I fixed a bug but the patch got
 partially overwritten by another patch and that caused a new bug). Both
 bugs are fixed in 2.6.25-rc2. Could you try that out if you are doing
 SG_IO to the sg device.
 

Yes, I'm using /dev/sg*. And yes again I'll checkout 2.6.25-rc2 ASIC.

Thanks
Mark
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New 2.6.24.2 SG_IO SCSI problems

2008-02-22 Thread Mark Hounschell
Mark Hounschell wrote:
 Mike Christie wrote:
 Mike Christie wrote:
 Mark Hounschell wrote:
 I seem to have run into some sort of regression in the SG_IO
 interface of 2.6.24.2. I have an application that up until 2.6.24
 worked fine. The 2.6.23.16 kernel works fine.

 During reads I get these kernel messages. Writes and other functions
 _seem_ OK. Actually basic
 reads  are working. Its with large BC reads using an io_vec list that
 the problem shows up.

 Are you doing SG_IO to the sg device (/dev/sg*) or to the block device
 (/dev/sdX)?
 If you are doing SG_IO to the sg device, then I know of one regression
 (well not regression exactly, but I fixed a bug but the patch got
 partially overwritten by another patch and that caused a new bug). Both
 bugs are fixed in 2.6.25-rc2. Could you try that out if you are doing
 SG_IO to the sg device.

 
 Yes, I'm using /dev/sg*. And yes again I'll checkout 2.6.25-rc2 ASIC.
 
 Thanks
 Mark
 -
 
2.6.25-rc2 does fix the problem I'm having. I don't suppose there is a patch
lying around for 2.6.24.2??

Thanks
Mark

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New 2.6.24.2 SG_IO SCSI problems

2008-02-21 Thread Mark Hounschell
James Bottomley wrote:
 On Thu, 2008-02-21 at 10:15 -0500, Mark Hounschell wrote:
 I seem to have run into some sort of regression in the SG_IO interface of 
 2.6.24.2. 
 I have an application that up until 2.6.24 worked fine. The 2.6.23.16 kernel 
 works fine.

 During reads I get these kernel messages. Writes and other functions _seem_ 
 OK. Actually basic
 reads  are working. Its with large BC reads using an io_vec list that the 
 problem shows up.

 Feb 21 09:27:51 harley kernel: (scsi1:A:2:0): data overrun detected in 
 Data-in phase.  Tag == 0x1.
 Feb 21 09:27:51 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length 
 = 256.  NumSGs = 1.
 Feb 21 09:27:51 harley kernel: sg[0] - Addr 0x06256100 : Length 256
 
 Help me a little here.  What was the io_vec and command you sent in to
 produce this?  The aic debugging information implies a single element sg
 list for a 256 byte read.
 
 James
 
 
 

Well, I did no 256 byte xfer at all.  My failing io_vec list has 6 elements.
The first 5 are for byte counts of 0xfffc and the last 0x6114. See below.

This is some debug info from within my app of the commands leading up the 
failure:
If you need actual values of the io_vec lists I will need to add some 
additional debug
info into the app. I will do if needed.

The disk BTW is formated at 768 byte sector size.
 
The first read has a 2 element io_vec list and reports no error:
ScsiDev_thread_7e00: Read CBD = 0x08 0x00 0x00 0x00 0x01 0x00
ScsiDev_thread_7e00: Read1(1) bc = 0x0078 addr = 0x00 Skip 0
ScsiDev_thread_7e00: Read(2) bc = 0x0288 addr = 0xb6cea368 Skip 1
ScsiDev_thread_7e00: SRead  DC ops = 2 short_bc = 288

The second read has a 4 element io_vec list and reports no error:
ScsiDev_thread_7e00: Read CBD = 0x08 0x00 0x00 0x00 0x05 0x00
ScsiDev_thread_7e00: Read1(1) bc = 0x0780 addr = 0x00 Skip 0
ScsiDev_thread_7e00: Read1(2) bc = 0x0670 addr = 0x00 Skip 1
ScsiDev_thread_7e00: Read1(3) bc = 0x003c addr = 0x000200 Skip 0
ScsiDev_thread_7e00: SRead(4) bc = 0x022c addr = 0xb6cea368 Skip 1
ScsiDev_thread_7e00: Read  DC ops = 4 short_bc = 22c

There is a seek here that reports no error:
ScsiDev_thread_7e00: Seek address 2752 BPS 768


This read has a 6 element io_vec list and reports  the error to the app:
ScsiDev_thread_7e00: ReadX CBD = 0x28 0x00 0x00 0x00 0x27 0x52 0x00 0x01 0xcb 
0x00
ScsiDev_thread_7e00: Read1(1) bc = 0xfffc addr = 0x000784 Skip 0
ScsiDev_thread_7e00: Read1(2) bc = 0xfffc addr = 0x010780 Skip 0
ScsiDev_thread_7e00: Read1(3) bc = 0xfffc addr = 0x02077c Skip 0
ScsiDev_thread_7e00: Read1(4) bc = 0xfffc addr = 0x030778 Skip 0
ScsiDev_thread_7e00: Read1(5) bc = 0xfffc addr = 0x040774 Skip 0
ScsiDev_thread_7e00: Read1(6) bc = 0x6114 addr = 0x050770 Skip 0
ScsiDev_thread_7e00: Read  DC ops = 6 short_bc = 0

ScsiDev_thread_7e00: scsi = 0x0 msg 0x0 host = 0x0007 driver = 0x
ScsiDev_thread_7e00: Read error: sns = 0x00 residual = 0x
ScsiDev_thread_7e00: Posting IPL status 0x0090 0x000e for Suba  to 
loc 0x00
ScsiDev_thread_7e00: Sleeping!!


Below is the complete dump of kernel messages for the above. I assume they are
a result of the last failing read but there is an awfull lot there just for 
that 6 element
io_vec list. Sorry to put all this in there but wanted to you to get the idea.


Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in Data-in 
phase.  Tag == 0x1.
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
256.  NumSGs = 1.
Feb 21 10:51:03 harley kernel: sg[0] - Addr 0x03252e100 : Length 256
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in Data-in 
phase.  Tag == 0x1.
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
256.  NumSGs = 1.
Feb 21 10:51:03 harley kernel: sg[0] - Addr 0x03252e100 : Length 256
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in Data-in 
phase.  Tag == 0x1.
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
256.  NumSGs = 1.
Feb 21 10:51:03 harley kernel: sg[0] - Addr 0x03252e100 : Length 256
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in Data-in 
phase.  Tag == 0x1.
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
256.  NumSGs = 1.
Feb 21 10:51:03 harley kernel: sg[0] - Addr 0x03252e100 : Length 256
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in Data-in 
phase.  Tag == 0x1.
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
256.  NumSGs = 1.
Feb 21 10:51:03 harley kernel: sg[0] - Addr 0x03252e100 : Length 256
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in Data-in 
phase.  Tag == 0x1.
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): Have seen Data Phase.  Length = 
256.  NumSGs = 1.
Feb 21 10:51:03 harley kernel: sg[0] - Addr 0x03252e100 : Length 256
Feb 21 10:51:03 harley kernel: (scsi1:A:2:0): data overrun detected in Data-in 
phase.  Tag == 0x1

Re: [PATCH] scsi_error: Fix language abuse.

2008-02-08 Thread Mark Hounschell
linux-os (Dick Johnson) wrote:
 
 The correct word should be invalid, in spite of
 the fact that the SCSI committee used invalid syntax.
 
 Alan is right. There is nothing illegal in the kernel
 and if there is, it must be removed as soon as it
 is discovered!
 
 
il·le·gal  (-lgl)
adj.
1. Prohibited by law.
*2. Prohibited by official rules: an illegal pass in football.
*3. Unacceptable to or not performable by a computer: an illegal operation.

Mark
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html