Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-30 Thread Samuel Wolf
Hi Salvatore,

> Greg just queued it:
> https://lore.kernel.org/all/2023123013-dose-skirmish-27c2@gregkh/

queued sounds good, so happy there is a solution in sight.

The good thing about this bug is that we can use our "faulty" server
again, we thought the server had a hardware issue:
https://bugzilla.kernel.org/show_bug.cgi?id=217599#c55

Samuel



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-30 Thread Salvatore Bonaccorso
Hi,

On Sat, Dec 30, 2023 at 10:44:27AM +0100, Samuel Wolf wrote:
> Hi Salvatore,
> 
> > Thanks for your testing! Yes this is enough from your side, thanks a
> > lot for taking the time for the explict test rounds!
> 
> no problem, thanks for all you work on the Debian project!

Very welcome as well. Thanks for this kind and very motivating
feedback.

> I hope you get a feedback on your question here:
> https://lore.kernel.org/all/zy8oxge0qkyuk...@eldamar.lan/

Greg just queued it:
https://lore.kernel.org/all/2023123013-dose-skirmish-27c2@gregkh/

> I've been very careful since the last O_DIRECT issue..
> But on the other hand, hanging storage is also not without danger.

Yes fully understandable. Even though the impact was quite targeted,
as Jan Kara explained https://lwn.net/Articles/954841/, this was a
major hassle to handle right in the mittle of of the point release
pushing.

Regards,
Salvatore



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-30 Thread Samuel Wolf
Hi Salvatore,

> Thanks for your testing! Yes this is enough from your side, thanks a
> lot for taking the time for the explict test rounds!

no problem, thanks for all you work on the Debian project!

I hope you get a feedback on your question here:
https://lore.kernel.org/all/zy8oxge0qkyuk...@eldamar.lan/

I've been very careful since the last O_DIRECT issue..
But on the other hand, hanging storage is also not without danger.

Samuel



Processed: Re: Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-30 Thread Debian Bug Tracking System
Processing control commands:

> tags -1 + confirmed pending
Bug #1059624 [src:linux] linux-image-6.1.0-16-amd64: aacraid abort request / 
SCSI hang after upgrade from 11.8 -> 12.4
Added tag(s) confirmed and pending.
> tags -1 - moreinfo
Bug #1059624 [src:linux] linux-image-6.1.0-16-amd64: aacraid abort request / 
SCSI hang after upgrade from 11.8 -> 12.4
Removed tag(s) moreinfo.

-- 
1059624: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1059624
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-30 Thread Salvatore Bonaccorso
Control: tags -1 + confirmed pending
Control: tags -1 - moreinfo

On Sat, Dec 30, 2023 at 01:06:20AM +0100, Samuel Wolf wrote:
> Hi Salvatore,
> 
> >  So it would be welcome if you find time to make it possible to test it by 
> > saturday evening.
> 
> my test was quicker than expected since i found a way to reproduce the
> issue on my test server.
> 
> Behind the Adaptec 8805 is a raid6 storage with 54TB and LUKS encrypted.
> As soon I open and mount the LUKS drive with kernel 6.1.67-1 the
> controller hang:
> 
> [  480.888273] aacraid: Host adapter abort request.
>aacraid: Outstanding commands on (0,0,3,0):
> [  480.902784] aacraid: Host bus reset request. SCSI hang ?
> [  480.902933] aacraid :02:00.0: outstanding cmd: midlevel-0
> [  480.902935] aacraid :02:00.0: outstanding cmd: lowlevel-0
> [  480.902936] aacraid :02:00.0: outstanding cmd: error handler-0
> [  480.902936] aacraid :02:00.0: outstanding cmd: firmware-251
> [  480.902937] aacraid :02:00.0: outstanding cmd: kernel-0
> [  480.916921] aacraid :02:00.0: Controller reset type is 3
> [  480.917076] aacraid :02:00.0: Issuing IOP reset
> [  517.004437] aacraid :02:00.0: IOP reset succeeded
> [  517.029007] aacraid: Comm Interface type2 enabled
> [  529.479247] aacraid :02:00.0: Scheduling bus rescan
> [  539.678274] aacraid :02:00.0: DDR cache data recovered successfull
> 
> This is reproducible with every luksClose and luksOpen mount.
> 
> Now I booting into your test kernel 6.1.67-1a~test and try the same again:
> 
> [9.610151] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready
> [   81.503552] EXT4-fs (dm-0): mounted filesystem with ordered data
> mode. Quota mode: none.
> [  119.133460] EXT4-fs (dm-0): unmounting filesystem.
> [  138.547366] sd 0:0:3:0: [sda] Very big device. Trying to use READ
> CAPACITY(16).
> [  139.214205] EXT4-fs (dm-0): mounted filesystem with ordered data
> mode. Quota mode: none.
> [  162.376044] EXT4-fs (dm-0): unmounting filesystem.
> [  182.222397] sd 0:0:3:0: [sda] Very big device. Trying to use READ
> CAPACITY(16).
> [  182.913977] EXT4-fs (dm-0): mounted filesystem with ordered data
> mode. Quota mode: none.
> [  217.611072] EXT4-fs (dm-0): unmounting filesystem.
> [  230.778060] sd 0:0:3:0: [sda] Very big device. Trying to use READ
> CAPACITY(16).
> [  231.386349] EXT4-fs (dm-0): mounted filesystem with ordered data
> mode. Quota mode: none.
> 
> No errors and the LUKS device is opened in ~1 second not like before
> in 1 minute.
> 
> Since I can not technical overview the patch/revert, is this enough
> testing for you?
> 
> Thanks for the test kernel.

Thanks for your testing! Yes this is enough from your side, thanks a
lot for taking the time for the explict test rounds!

Regards,
Salvatore



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Samuel Wolf
Hi Salvatore,

>  So it would be welcome if you find time to make it possible to test it by 
> saturday evening.

my test was quicker than expected since i found a way to reproduce the
issue on my test server.

Behind the Adaptec 8805 is a raid6 storage with 54TB and LUKS encrypted.
As soon I open and mount the LUKS drive with kernel 6.1.67-1 the
controller hang:

[  480.888273] aacraid: Host adapter abort request.
   aacraid: Outstanding commands on (0,0,3,0):
[  480.902784] aacraid: Host bus reset request. SCSI hang ?
[  480.902933] aacraid :02:00.0: outstanding cmd: midlevel-0
[  480.902935] aacraid :02:00.0: outstanding cmd: lowlevel-0
[  480.902936] aacraid :02:00.0: outstanding cmd: error handler-0
[  480.902936] aacraid :02:00.0: outstanding cmd: firmware-251
[  480.902937] aacraid :02:00.0: outstanding cmd: kernel-0
[  480.916921] aacraid :02:00.0: Controller reset type is 3
[  480.917076] aacraid :02:00.0: Issuing IOP reset
[  517.004437] aacraid :02:00.0: IOP reset succeeded
[  517.029007] aacraid: Comm Interface type2 enabled
[  529.479247] aacraid :02:00.0: Scheduling bus rescan
[  539.678274] aacraid :02:00.0: DDR cache data recovered successfull

This is reproducible with every luksClose and luksOpen mount.

Now I booting into your test kernel 6.1.67-1a~test and try the same again:

[9.610151] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready
[   81.503552] EXT4-fs (dm-0): mounted filesystem with ordered data
mode. Quota mode: none.
[  119.133460] EXT4-fs (dm-0): unmounting filesystem.
[  138.547366] sd 0:0:3:0: [sda] Very big device. Trying to use READ
CAPACITY(16).
[  139.214205] EXT4-fs (dm-0): mounted filesystem with ordered data
mode. Quota mode: none.
[  162.376044] EXT4-fs (dm-0): unmounting filesystem.
[  182.222397] sd 0:0:3:0: [sda] Very big device. Trying to use READ
CAPACITY(16).
[  182.913977] EXT4-fs (dm-0): mounted filesystem with ordered data
mode. Quota mode: none.
[  217.611072] EXT4-fs (dm-0): unmounting filesystem.
[  230.778060] sd 0:0:3:0: [sda] Very big device. Trying to use READ
CAPACITY(16).
[  231.386349] EXT4-fs (dm-0): mounted filesystem with ordered data
mode. Quota mode: none.

No errors and the LUKS device is opened in ~1 second not like before
in 1 minute.

Since I can not technical overview the patch/revert, is this enough
testing for you?

Thanks for the test kernel.

Samuel



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Salvatore Bonaccorso
Hi Samuel,

On Fri, Dec 29, 2023 at 08:30:55PM +0100, Samuel Wolf wrote:
> Hi Salvatore,
> 
> > if you are allowed to deploy unofficial builds: Would you be willing
> > to test that the packages in
> > https://people.debian.org/~carnil/tmp/linux/1059624/ fix the problem?
> > They contain that specific revert on top. I have signed the sha256sum
> > file with my key found in the Debian keyring.
> 
> thank you, I can test this kernel on a test server with the same
> Adaptec controller.
> 
> > I cannot really determine right now how many people are affected.
> 
> In theory, anyone who uses aacraid drivers/controllers.

Right, but apparently not every controller type, but at least we know
some types affected.

> > So in case you manage to test the above packages quite soon there is a
> > chance we can have it in the next upload. Otherwise in the next after.
> 
> I'll try to test this by saturday evening, is that too late?

Well yes the timing is a bit unfortunate. I was working today on
finalizing the upload for bookworm (6.1.69-1) and bullseye
(5.10.205-1), before then your bugreport hit the bugtracker.

But at this point I guess I can delay that finalizing a bit further to
see your testing happening. So it would be welcome if you find time to
make it possible to test it by saturday evening.

Regards,
Salvatore



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Samuel Wolf
Hi Salvatore,

> if you are allowed to deploy unofficial builds: Would you be willing
> to test that the packages in
> https://people.debian.org/~carnil/tmp/linux/1059624/ fix the problem?
> They contain that specific revert on top. I have signed the sha256sum
> file with my key found in the Debian keyring.

thank you, I can test this kernel on a test server with the same
Adaptec controller.

> I cannot really determine right now how many people are affected.

In theory, anyone who uses aacraid drivers/controllers.

> So in case you manage to test the above packages quite soon there is a
> chance we can have it in the next upload. Otherwise in the next after.

I'll try to test this by saturday evening, is that too late?

Samuel



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Salvatore Bonaccorso
Hi Samuel,

On Fri, Dec 29, 2023 at 03:52:58PM +0100, Samuel Wolf wrote:
> Hi Salvatore,
> 
> > And can you confirm that the patch revert fixes your issue?
> 
> unfortunately not, we downgraded to Debian 11 to get the server
> working again and
> I have not enough knowledge to build and test such an kernel.

if you are allowed to deploy unofficial builds: Would you be willing
to test that the packages in
https://people.debian.org/~carnil/tmp/linux/1059624/ fix the problem?
They contain that specific revert on top. I have signed the sha256sum
file with my key found in the Debian keyring.

> > The revert landed in mainline, but has not been queued for the stable 
> > series yet.
> 
> I guess it's better to wait for the stable series revert (from the
> Debian standpoint) and backport it than into Debian 12?

It's marked for stable and people on the bug did confirm it fix the
issue, I'm pretty confident that it will be picked up as well for the
stable queues, and quess the whole was delayed.

I cannot really determine right now how many people are affected.
There will be quite soon a new bookworm upload, the next one following
will be likely in february for the point release.

So in case you manage to test the above packages quite soon there is a
chance we can have it in the next upload. Otherwise in the next after.

Regards,
Salvatore



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Samuel Wolf
Hi Salvatore,

> And can you confirm that the patch revert fixes your issue?

unfortunately not, we downgraded to Debian 11 to get the server
working again and
I have not enough knowledge to build and test such an kernel.

> The revert landed in mainline, but has not been queued for the stable series 
> yet.

I guess it's better to wait for the stable series revert (from the
Debian standpoint) and backport it than into Debian 12?

Thanks.

Samuel



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Salvatore Bonaccorso
Control: tags -1 + moreinfo

Hi Samuel,

On Fri, Dec 29, 2023 at 02:44:35PM +0100, Samuel Wolf wrote:
> Package: src:linux
> Version: 6.1.67-1
> Severity: normal
> X-Debbugs-Cc: samuelwol...@googlemail.com
> 
> Hello,
> 
> we upgraded our server with Adaptec ASR8805 raid controller from Debian 11.8 
> to 12.4.
> 
> After booting into the 6.1x kernel the system boot and works without load,
> but as soon the server has some load the system stops/freeze with abort 
> request / SCSI hang.
> 
> This is a known issue, is it possible to backport this bugfix into Debian 12?
> https://bugzilla.kernel.org/show_bug.cgi?id=217599

And can you confirm that the patch revert fixes your issue? it is most
likely so looking trough the upstream bug log, but an additional
confirmation is welcome. The revert landed in mainline, but has not
been queued for the stable series yet.

Regards,
Salvatore



Processed: Re: Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Debian Bug Tracking System
Processing control commands:

> tags -1 + moreinfo
Bug #1059624 [src:linux] linux-image-6.1.0-16-amd64: aacraid abort request / 
SCSI hang after upgrade from 11.8 -> 12.4
Added tag(s) moreinfo.

-- 
1059624: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1059624
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#1059624: linux-image-6.1.0-16-amd64: aacraid abort request / SCSI hang after upgrade from 11.8 -> 12.4

2023-12-29 Thread Samuel Wolf
Package: src:linux
Version: 6.1.67-1
Severity: normal
X-Debbugs-Cc: samuelwol...@googlemail.com

Hello,

we upgraded our server with Adaptec ASR8805 raid controller from Debian 11.8 to 
12.4.

After booting into the 6.1x kernel the system boot and works without load,
but as soon the server has some load the system stops/freeze with abort request 
/ SCSI hang.

This is a known issue, is it possible to backport this bugfix into Debian 12?
https://bugzilla.kernel.org/show_bug.cgi?id=217599


-- Package-specific info:
** Version:
Linux version 6.1.0-16-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 
12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP 
PREEMPT_DYNAMIC Debian 6.1.67-1 (2023-12-12)

** Command line:
BOOT_IMAGE=/boot/vmlinuz-6.1.0-16-amd64 
root=UUID=3f04dcfb-f323-4b62-aefa-219e71ea4f35 ro quiet splash

** Not tainted