from:"Nix"

Re: [PATCH 5/5] scsi: Set allocation length to 255 for ATA Information VPD page

2021-04-15 Thread Nix

On 14 Apr 2021, Maciej W. Rozycki stated:

> Set the allocation length to 255 for the ATA Information VPD page 
> requested in the WRITE SAME handler, so as not to limit information 
> examined by `scsi_get_vpd_page' in the supported vital product data 
> pages unnecessarily.
>
> Originally it was thought that Areca hardware may have issues with a 
> valid allocation length supplied for a VPD inquiry, however older SCSI 
> standard revisions[1] consider 255 the maximum length allowed and what 

Aaaah. That explains a lot! (Not that I can remember what SCSI standard
rev that Areca firmware claimed to implement. I know I never updated the
firmware, so it's going to be something no newer than mid-2009 and
probably quite a bit older.)

> Nix,
>
>  I can see you're still around.  Would you therefore please be so kind 
> as to verify this change with your Areca hardware if you still have it?

It's been up in the loft for years, but I'll get it out this weekend and
give it a spin :) this'll let me make sure the disks still spin as well,
which matters for an in-case-of-lightning-strike disaster-recovery
backup box.

(I just hope this kernel boots on it at all. It's about three years
since I retired it... let's see!)

>  It looks to me like you were thinking in the right direction with: 
> <https://lore.kernel.org/linux-scsi/87vc3nuipg@spindle.srvr.nix/>. 

It's the sort of mistake I could see myself making: an easy mistake to
make when so many things in C require buffer size - 1 or you get a
disastrous security hole...

-- 
NULL && (void)

Re: 4.20.7: pl2303 not working (post-4.19 regression) (limited info so far, not yet bisected)

2019-02-18 Thread Nix

On 18 Feb 2019, Johan Hovold stated:

> On Sun, Feb 17, 2019 at 07:13:52PM +0000, Nix wrote:
>> I'm still fairly sure this is a regression -- my machines are often up
>> for a lot longer than that and I've never seen this before I upgraded to
>> 4.20.x -- but I don't think I'm going to identify it by mindless
>> bisection. I might have to actually *think* about it.
>
> I doubt it's a regression in usb-serial as essentially nothing changed
> with respect to pl2303 or core since 4.19.

Yeah, I came to that conclusion as well.

> The -ENOSPC you're seeing is returned by the host controller to
> indicate:
>
>   This request would overcommit the usb bandwidth reserved for
>   periodic transfers (interrupt, isochronous).

Side note: probably not related to *this* -ENOSPC, which I've been
seeing for a few releases now and which appears to break Chromium's U2F
negotiation when the USB bus has sufficiently weird devices on it (like,
uh, my wireless mouse):

<https://bugs.chromium.org/p/chromium/issues/detail?id=932699>

(I say "probably not related" because it's much older and long predates
the pl2303 trouble.)

> but if you're saying you can reproduce this on "every box" it may not be
> related to any particular host-controller driver (or USB topology).

They are all xhci, at least. The pl2303 is USB 2. One machine, a
two-year-old Broadwell server, says:

xhci_hcd :00:14.0: xHCI Host Controller
xhci_hcd :00:14.0: new USB bus registered, assigned bus number 3
xhci_hcd :00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 
0x9810
xhci_hcd :00:14.0: cache line size of 32 is not supported
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 15 ports detected
xhci_hcd :00:14.0: xHCI Host Controller
xhci_hcd :00:14.0: new USB bus registered, assigned bus number 4
xhci_hcd :00:14.0: Host supports USB 3.0  SuperSpeed

00:14.0 USB controller: Intel Corporation C610/X99 series chipset USB xHCI Host 
Controller (rev 05) (prog-if 30 [XHCI])

The other, a 2012-era cheapish Ivy Bridge workstation:

xhci_hcd :03:00.0: xHCI Host Controller
xhci_hcd :03:00.0: new USB bus registered, assigned bus number 3
xhci_hcd :03:00.0: hcc params 0x014042cb hci version 0x96 quirks 
0x0004
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
xhci_hcd :03:00.0: xHCI Host Controller
xhci_hcd :03:00.0: new USB bus registered, assigned bus number 4
xhci_hcd :03:00.0: Host supports USB 3.0  SuperSpeed
usb usb4: We don't know the algorithms for LPM for this host, disabling LPM.
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
xhci_hcd :04:00.0: xHCI Host Controller
xhci_hcd :04:00.0: new USB bus registered, assigned bus number 5
xhci_hcd :04:00.0: hcc params 0x0200f180 hci version 0x96 quirks 
0x0008
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
xhci_hcd :04:00.0: xHCI Host Controller
xhci_hcd :04:00.0: new USB bus registered, assigned bus number 6
xhci_hcd :04:00.0: Host supports USB 3.0  SuperSpeed
usb usb6: We don't know the algorithms for LPM for this host, disabling LPM.

03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 
03) (prog-if 30 [XHCI])
04:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host 
Controller (prog-if 30 [XHCI])

(I really don't know which of these is which. I suspect only one
actually has any visible ports on the outside of the case...)

So the quirks are all totally different, and the controllers are quite
different as well...

Re: 4.20.7: pl2303 not working (post-4.19 regression) (limited info so far, not yet bisected)

2019-02-17 Thread Nix

On 16 Feb 2019, Greg KH told this:

> On Sat, Feb 16, 2019 at 04:26:30PM +0000, Nix wrote:
>> So I just tried to connect up to my ancient Soekris firewall's serial
>> console to try to bisect a problem where it stopped booting in 4.20, and
>> found I couldn't.
>> 
>> minicom says:
>> 
>> minicom: cannot open /dev/ttyUSB0: Input/output error
>> 
>> and in the dmesg we see
>> 
>> [705576.028170] pl2303 ttyUSB0: failed to submit interrupt urb: -28
>>
>> Booting to 4.19, everything works fine. (A random GalliumOS Chromebook
[...]
>
> bisection would be great, thanks!

Rrrg. This is going to be harder than I thought. Rebooting, everything
works fine! So this is something that kicks in after something less than
eight days uptime, consistently on every box I own, but is fine on
reboot.

I'm still fairly sure this is a regression -- my machines are often up
for a lot longer than that and I've never seen this before I upgraded to
4.20.x -- but I don't think I'm going to identify it by mindless
bisection. I might have to actually *think* about it.

4.20.7: pl2303 not working (post-4.19 regression) (limited info so far, not yet bisected)

2019-02-16 Thread Nix

So I just tried to connect up to my ancient Soekris firewall's serial
console to try to bisect a problem where it stopped booting in 4.20, and
found I couldn't.

minicom says:

minicom: cannot open /dev/ttyUSB0: Input/output error

and in the dmesg we see

[705576.028170] pl2303 ttyUSB0: failed to submit interrupt urb: -28

Booting to 4.19, everything works fine. (A random GalliumOS Chromebook
running 4.9.4 works fine too, not that that confirmation is terribly
useful.)

This is an extremely preliminary report in case it's instantly obvious
what's going on: I'll do enough investigation to produce an actually
useful bug report, including bisecting this, after I've bisected the
*other* non-booting bug, but that might not be until next weekend. (All
this for a firewall I was trying to decommission! bah :) )

Re: bcache on XFS: metadata I/O (dirent I/O?) not getting cached at all?

2019-02-07 Thread Nix

On 7 Feb 2019, Coly Li uttered the following:
> On 2019/2/7 6:11 上午, Nix wrote:
>> As it is, this seems to render bcache more or less useless with XFS,
>> since bcache's primary raison d'etre is precisely to cache seeky stuff
>> like metadata. :(
>
> Hi Nix,
>
> Could you please to try whether the attached patch makes things better ?

Looks good! Before huge tree cp -al:

loom:~# bcache-stats 
stats_total/bypassed: 1.0G
stats_total/cache_bypass_hits: 16
stats_total/cache_bypass_misses: 26436
stats_total/cache_hit_ratio: 46
stats_total/cache_hits: 24349
stats_total/cache_miss_collisions: 8
stats_total/cache_misses: 27898
stats_total/cache_readaheads: 0

After:

stats_total/bypassed: 1.1G
stats_total/cache_bypass_hits: 16
stats_total/cache_bypass_misses: 27176
stats_total/cache_hit_ratio: 43
stats_total/cache_hits: 24443
stats_total/cache_miss_collisions: 9
stats_total/cache_misses: 32152   <
stats_total/cache_readaheads: 0

So loads of new misses. (A bunch of bypassed misses too. Not sure where
those came from, maybe some larger sequential reads somewhere, but a lot
is getting cached now, and every bit of metadata that gets cached means
things get a bit faster.)

btw I have ported ewheeler's ioprio-based cache hinting patch to 4.20;
I/O below the ioprio threshold bypasses everything, even metadata and
REQ_PRIO stuff. It was trivial, but I was able to spot and fix a tiny
bypass accounting bug in the patch in the process): see
http://www.esperi.org.uk/~nix/bundles/bcache-ioprio.bundle. (I figured
you didn't want almost exactly the same patch series as before posted to
the list, but I can do that if you prefer.)

Put this all together and it seems to work very well: my test massive
compile triggered 500MiB of metadata writes at the start and then the
actual compile (being entirely sequential reads) hardly wrote anything
out and was almost entirely bypassed: meanwhile a huge git push I ran at
idle priority didn't pollute the cache at all. Excellent!

(I'm also keeping write volumes down by storing transient things like
objdirs that just sit in the page cache and then get deleted on a
separate non-bcached, non-journalled ext4 fs at the start of the the
spinning rust disk, with subdirs of this fs bind-mounted into various
places as needed. I should make the scripts that do that public because
they seem likely to be useful to bcache users...)

Semi-unrelated side note: after my most recent reboot, which involved a
bcache journal replay even though my shutdown was clean, the stats_total
reset; the cache device's bcache/written and
bcache/set/cache_available_percent also flipped to 0 and 100%,. I
suspect this is merely a stats bug of some sort, because the boot was
notably faster than before and cache_hits was about 6000 by the time it
was done. bcache/priority_stats *does* say that the cache is "only" 98%
unused, like it did before. Maybe cache_available_percent doesn't mean
what I thought it did.

-- 
NULL && (void)

Re: bcache on XFS: metadata I/O (dirent I/O?) not getting cached at all?

2019-02-07 Thread Nix

On 7 Feb 2019, Coly Li stated:
> On 2019/2/7 10:26 上午, Dave Chinner wrote:
>> So, yeah, that needs to be reverted if you want bcache to function
>> properly for metadata caching.
>
> Sure, I will fix this, once I make it clear to me.

I'll give it a test :)

The meaning of these flags was somewhat opaque to me, too (mostly due to
novelty: I've never really looked at anything in the block layer
before).

-- 
NULL && (void)

bcache on XFS: metadata I/O (dirent I/O?) not getting cached at all?

2019-02-06 Thread Nix

So I just upgraded to 4.20 and revived my long-turned-off bcache now
that the metadata corruption leading to mount failure on dirty close may
have been identified (applying Tang Junhui's patch to do so)... and I
spotted something a bit disturbing. It appears that XFS directory and
metadata I/O is going more or less entirely uncached.

Here's some bcache stats before and after a git status of a *huge*
uncached tree (Chromium) on my no-writeback readaround cache. It takes
many minutes and pounds the disk with massively seeky metadata I/O in
the process:

Before:

stats_total/bypassed: 48.3G
stats_total/cache_bypass_hits: 7942
stats_total/cache_bypass_misses: 861045
stats_total/cache_hit_ratio: 3
stats_total/cache_hits: 16286
stats_total/cache_miss_collisions: 25
stats_total/cache_misses: 411575
stats_total/cache_readaheads: 0

After:
stats_total/bypassed: 49.3G
stats_total/cache_bypass_hits: 7942
stats_total/cache_bypass_misses: 1154887
stats_total/cache_hit_ratio: 3
stats_total/cache_hits: 16291
stats_total/cache_miss_collisions: 25
stats_total/cache_misses: 411625
stats_total/cache_readaheads: 0

Huge increase in bypassed reads, essentially no new cached reads. This
is... basically the optimum case for bcache, and it's not caching it!

>From my reading of xfs_dir2_leaf_readbuf(), it looks like essentially
all directory reads in XFS appear to bcache as a single non-readahead
followed by a pile of readahead I/O: bcache bypasses readahead bios, so
all directory reads (or perhaps all directory reads larger than a single
block) are going to be bypassed out of hand.

This seems... suboptimal, but so does filling up the cache with
read-ahead blocks (particularly for non-metadata) that are never used.
Anyone got any ideas, 'cos I'm currently at a loss: XFS doesn't appear
to let us distinguish between "read-ahead just in case but almost
certain to be accessed" (like directory blocks) and "read ahead on the
offchance because someone did a single-block file read and what the hell
let's suck in a bunch more".

As it is, this seems to render bcache more or less useless with XFS,
since bcache's primary raison d'etre is precisely to cache seeky stuff
like metadata. :(

Re: [4.1.x -- 4.6.x and probably HEAD] Reproducible unprivileged panic/TLB BUG on sparc via a stack-protected rt_sigaction() ka_restorer, courtesy of the glibc testsuite

2016-05-30 Thread Nix

On 29 May 2016, David Miller spake thusly:

> BTW Nick, in thinking through all of this, I want to strongly encourage
> you to disable stack protector for all sigreturn stubs in the GLIBC tree.

I completely concur, and have already written (but not committed) a
patch to do this: I'll augment the existing sparc-only patch into a
sigreturn-stubs patch. I *think* I spotted all the stubs. (Many of them
are in assembler, but not all.)

(If there's anything else which involves calling functions with a
precisely-aligned stack and an expectation of no stack pointer movement
in the prologue or epilogue, I'd be interested to know about it, since
that'll need inhibit_stack_protector'ing too.)

-- 
NULL && (void)

Re: [4.1.x -- 4.6.x and probably HEAD] Reproducible unprivileged panic/TLB BUG on sparc via a stack-protected rt_sigaction() ka_restorer, courtesy of the glibc testsuite

2016-05-30 Thread Nix

On 29 May 2016, David Miller spake thusly:

> BTW Nick, in thinking through all of this, I want to strongly encourage
> you to disable stack protector for all sigreturn stubs in the GLIBC tree.

I completely concur, and have already written (but not committed) a
patch to do this: I'll augment the existing sparc-only patch into a
sigreturn-stubs patch. I *think* I spotted all the stubs. (Many of them
are in assembler, but not all.)

(If there's anything else which involves calling functions with a
precisely-aligned stack and an expectation of no stack pointer movement
in the prologue or epilogue, I'd be interested to know about it, since
that'll need inhibit_stack_protector'ing too.)

-- 
NULL && (void)

Re: 4.4.1 regression from 4.1.x: Soekris net5501 crash in IRQ after mfgpt timer initialization

2016-02-02 Thread Nix

On 2 Feb 2016, Thomas Gleixner verbalised:

> On Tue, 2 Feb 2016, Nix wrote:
>> The fairly trivial code motion below also seems to work, and may be more
>> like an actual fix, though I'm a bit horrified that it's this simple. I
>> may well have moved too much and unknowingly violated some invariant.
>
> I was lazy and did not do this, because it wreckages the error pathes. So I
> went for the workaround in the hope that the authors of that stuff will take
> care :)

Oh true.

As far as I can tell, getting this right requires a function that
reverses the effect of clockevents_config_and_register(), which does not
appear to exist yet :( everything else appears more or less
reversible...

-- 
NULL && (void)

4.4.1 regression from 4.1.x: Soekris net5501 crash in IRQ after mfgpt timer initialization

2016-02-02 Thread Nix

[Cc:ed Thomas on the vague hope that maybe this is osmething to do with
 the IRQ subsystem in general, though I doubt it, since only the one
 machine is crashing for me: it's probably the CS5531's interactions
 with said subsystem at fault.]

So I just upgraded from 4.1 to 4.4.1, secure in the knowledge that some
disaster would overtake me. And, indeed, my Soekris net5501 firewall no
longer boots, with this oops (full log and .config below):

[1.589543] cs5535-clockevt: Registering MFGPT timer as a clock event, using 
IRQ 7
[1.604921] BUG: unable to handle kernel NULL pointer dereference at   (null)
[1.604929] IP: [<  (null)>]   (null)
[1.604936] *pde = 
[1.604945] Oops:  [#1]
[1.604960] CPU: 0 PID: 1 Comm: swapper Not tainted 4.4.1+ #1
[1.604969] task: df44c000 ti: df436000 task.ti: df436000
[1.604978] EIP: 0060:[<>] EFLAGS: 00010093 CPU: 0
[1.604985] EIP is at 0x0
[1.604993] EAX: c048d760 EBX: df406540 ECX:  EDX: 620c
[1.605002] ESI: 0007 EDI: c048d720 EBP: df409fbc ESP: df409fb8
[1.605011]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[1.605019] CR0: 8005003b CR2:  CR3: 004d5000 CR4: 0090
[1.605022] Stack:
[1.605044]  c02cf141 df409fdc c013f40d  0001  df406540 
df406550
[1.605066]  c0140f94 df409fe8 c013f4d3 df406540 df409ff8 c0141005 df40bf70 
df406540
[1.605074]  df40bf78 c0102c9b
[1.605077] Call Trace:
[1.605101]  [] ? mfgpt_tick+0x6e/0x77
[1.605127]  [] handle_irq_event_percpu+0x26/0xcf
[1.605141]  [] ? handle_simple_irq+0x46/0x46
[1.605157]  [] handle_irq_event+0x1d/0x29
[1.605170]  [] handle_level_irq+0x71/0x95
[1.605187]  [] handle_irq+0x44/0x56
[1.605204]  
[1.605206]  [] do_IRQ+0x32/0x99
[1.605228]  [] common_interrupt+0x29/0x30
[1.605246]  [] ? __setup_irq+0x1ff/0x3b6
[1.605263]  [] ? __do_softirq+0x53/0x154
[1.605278]  [] ? _local_bh_enable+0x3a/0x3a
[1.605292]  [] do_softirq_own_stack+0x1c/0x22
[1.605309]  
[1.605311]  [] irq_exit+0x31/0x67
[1.605323]  [] do_IRQ+0x86/0x99
[1.605339]  [] common_interrupt+0x29/0x30
[1.605362]  [] ? try_to_grab_pending+0x38/0xec
[1.605378]  [] ? console_unlock+0x211/0x39b
[1.605394]  [] vprintk_emit+0x2b4/0x2be
[1.605410]  [] ? init_hrt_clocksource+0xa4/0xa4
[1.605426]  [] vprintk_default+0x12/0x14
[1.605446]  [] printk+0x11/0x13
[1.605462]  [] cs5535_mfgpt_init+0xce/0xf1
[1.605480]  [] do_one_initcall+0xce/0x13e
[1.605496]  [] ? parse_args+0x1bd/0x281
[1.605514]  [] ? kernel_init_freeable+0xb9/0x156
[1.605531]  [] kernel_init_freeable+0xd9/0x156
[1.60]  [] kernel_init+0x8/0xb5
[1.605571]  [] ret_from_kernel_thread+0x20/0x34
[1.605585]  [] ? rest_init+0x59/0x59
[1.605598] Code:  Bad EIP value.
[1.605610] EIP: [<>] 0x0 SS:ESP 0068:df409fb8
[1.605615] CR2: 
[1.605647] ---[ end trace 24cdf30ee51cea99 ]---
[1.605654] Kernel panic - not syncing: Fatal exception in interrupt
[1.605659] Kernel Offset: disabled
[1.614882] Rebooting in 5 seconds..

As far as I can tell, it's not actually crashing *in*
handle_irq_event_percpu(): at least, piling printk()s in there shows
nothing, and piling printk()s in mfgpt_tick() shows successful
completion: that's clearly residue from a previous frame. I'm not sure
*where* it's crashing at this point. In particular, cs5535_mfgpt_init()
only calls platform_driver_register(): it never calls printk(), but both
are shown as non-?. So I am somewhat at a loss with this one so far: the
stacktrace looks more or less impossible, even quite a long way up. I
guess I'll be reduced to bisection :(

I'm using a GCC 4.9.4 prerelease (dated 20151031) which has built
multiple (4.1) kernels perfectly happily before now and has built 4.4.1
on two other machines without visible problem.

Full boot log:

POST: 012345689bcefghips1234ajklnopqr,,,tvwxy








comBIOS ver. 1.33  20070103  Copyright (C) 2000-2007 Soekris Engineering.

net5501

 CPU Geode LX 500 Mhz  Mbyte 
Memory0512

Pri Mas  SanDisk SDCFH2-002G LBA Xlt 992-64-63  2001 Mbyte

Slot   Vend Dev  ClassRev Cmd  Stat CL LT HT  Base1Base2   Int 
---
0:01:2 1022 2082 1010 0006 0220 08 00 00 A000  10
0:06:0 1106 3053 0296 0117 0210 08 40 00 E101 A0004000 11
0:07:0 1106 3053 0296 0117 0210 08 40 00 E201 A0004100 05
0:08:0 1106 3053 0296 0117 0210 08 40 00 E301 A0004200 09
0:09:0 1106 3053 0296 0117 0210 08 40 00 E401 A0004300 12
0:14:0 104C AC23 06040002 0107 0210 08 40 01   
0:20:0 1022 2090 06010003 0009 02A0 08 40 80 6001 6101 
0:20:2 1022 209A 01018001 0005 02A0 08 00 00   
0:21:0 1022 2094 0C031002 0006 0230 08 00 80 A0005000  15
0:21:1 1022 2095 0C032002 0006 0230

Re: 4.4.1 regression from 4.1.x: Soekris net5501 crash in IRQ after mfgpt timer initialization

2016-02-02 Thread Nix

On 2 Feb 2016, Thomas Gleixner said:

> On Tue, 2 Feb 2016, Nix wrote:
>
>> [Cc:ed Thomas on the vague hope that maybe this is osmething to do with
>>  the IRQ subsystem in general, though I doubt it, since only the one
>>  machine is crashing for me: it's probably the CS5531's interactions
>>  with said subsystem at fault.]
>
> Kinda. That driver does the following:
>
>setup the irq in CS5531
>
>request the interrupt to install the handler
>
>register the clockevents device

It seems like it should do those in the opposite order, really, or at
the very least do the IRQ setup last!

> So the interrupt hits before the clockevent device is registered and the event
> handler is installed. So mfgpt_tick() will happily call a null pointer.
>
> The patch below should fix^Wwork around the issue.

The fairly trivial code motion below also seems to work, and may be more
like an actual fix, though I'm a bit horrified that it's this simple. I
may well have moved too much and unknowingly violated some invariant.

(Note: the actual code motion was of course to move the IRQ registration
down, but git chose to depict it as the opposite, somewhat unclearly.)

Done under my work address because all this firewall rebooting is
stopping me from getting work done:

>From 4ba04a48573c8a2136533556a3fbef7de288913f Mon Sep 17 00:00:00 2001
From: Nick Alcock 
Date: Tue, 2 Feb 2016 14:57:56 +
Subject: [PATCH] cs5535-clockevt: set up the MFGPT only after registering the
 IRQ

This prevents a race whereby the IRQ arrives before the clockevent
handler is installed.

Signed-off-by: Nick Alcock 
Inspired-by: Thomas Gleixner 
---
 drivers/clocksource/cs5535-clockevt.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/clocksource/cs5535-clockevt.c 
b/drivers/clocksource/cs5535-clockevt.c
index 9a7e37c..5737e17 100644
--- a/drivers/clocksource/cs5535-clockevt.c
+++ b/drivers/clocksource/cs5535-clockevt.c
@@ -152,6 +152,18 @@ static int __init cs5535_mfgpt_init(void)
}
cs5535_event_clock = timer;
 
+   /* Set the clock scale and enable the event mode for CMP2 */
+   val = MFGPT_SCALE | (3 << 8);
+
+   cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
+
+   /* Set up the clock event */
+   printk(KERN_INFO DRV_NAME
+   ": Registering MFGPT timer as a clock event, using IRQ %d\n",
+   timer_irq);
+   clockevents_config_and_register(_clockevent, MFGPT_HZ,
+   0xF, 0xFFFE);
+
/* Set up the IRQ on the MFGPT side */
if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, _irq)) {
printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
@@ -166,18 +178,6 @@ static int __init cs5535_mfgpt_init(void)
goto err_irq;
}
 
-   /* Set the clock scale and enable the event mode for CMP2 */
-   val = MFGPT_SCALE | (3 << 8);
-
-   cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
-
-   /* Set up the clock event */
-   printk(KERN_INFO DRV_NAME
-   ": Registering MFGPT timer as a clock event, using IRQ %d\n",
-   timer_irq);
-   clockevents_config_and_register(_clockevent, MFGPT_HZ,
-   0xF, 0xFFFE);
-
return 0;
 
 err_irq:
-- 
2.7.0.198.g6dd47b6

Re: 4.4.1 regression from 4.1.x: Soekris net5501 crash in IRQ after mfgpt timer initialization

2016-02-02 Thread Nix

On 2 Feb 2016, Thomas Gleixner said:

> On Tue, 2 Feb 2016, Nix wrote:
>
>> [Cc:ed Thomas on the vague hope that maybe this is osmething to do with
>>  the IRQ subsystem in general, though I doubt it, since only the one
>>  machine is crashing for me: it's probably the CS5531's interactions
>>  with said subsystem at fault.]
>
> Kinda. That driver does the following:
>
>setup the irq in CS5531
>
>request the interrupt to install the handler
>
>register the clockevents device

It seems like it should do those in the opposite order, really, or at
the very least do the IRQ setup last!

> So the interrupt hits before the clockevent device is registered and the event
> handler is installed. So mfgpt_tick() will happily call a null pointer.
>
> The patch below should fix^Wwork around the issue.

The fairly trivial code motion below also seems to work, and may be more
like an actual fix, though I'm a bit horrified that it's this simple. I
may well have moved too much and unknowingly violated some invariant.

(Note: the actual code motion was of course to move the IRQ registration
down, but git chose to depict it as the opposite, somewhat unclearly.)

Done under my work address because all this firewall rebooting is
stopping me from getting work done:

>From 4ba04a48573c8a2136533556a3fbef7de288913f Mon Sep 17 00:00:00 2001
From: Nick Alcock <nick.alc...@oracle.com>
Date: Tue, 2 Feb 2016 14:57:56 +
Subject: [PATCH] cs5535-clockevt: set up the MFGPT only after registering the
 IRQ

This prevents a race whereby the IRQ arrives before the clockevent
handler is installed.

Signed-off-by: Nick Alcock <nick.alc...@oracle.com>
Inspired-by: Thomas Gleixner <t...@linutronix.de>
---
 drivers/clocksource/cs5535-clockevt.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/clocksource/cs5535-clockevt.c 
b/drivers/clocksource/cs5535-clockevt.c
index 9a7e37c..5737e17 100644
--- a/drivers/clocksource/cs5535-clockevt.c
+++ b/drivers/clocksource/cs5535-clockevt.c
@@ -152,6 +152,18 @@ static int __init cs5535_mfgpt_init(void)
}
cs5535_event_clock = timer;
 
+   /* Set the clock scale and enable the event mode for CMP2 */
+   val = MFGPT_SCALE | (3 << 8);
+
+   cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
+
+   /* Set up the clock event */
+   printk(KERN_INFO DRV_NAME
+   ": Registering MFGPT timer as a clock event, using IRQ %d\n",
+   timer_irq);
+   clockevents_config_and_register(_clockevent, MFGPT_HZ,
+   0xF, 0xFFFE);
+
/* Set up the IRQ on the MFGPT side */
if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, _irq)) {
printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
@@ -166,18 +178,6 @@ static int __init cs5535_mfgpt_init(void)
goto err_irq;
}
 
-   /* Set the clock scale and enable the event mode for CMP2 */
-   val = MFGPT_SCALE | (3 << 8);
-
-   cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
-
-   /* Set up the clock event */
-   printk(KERN_INFO DRV_NAME
-   ": Registering MFGPT timer as a clock event, using IRQ %d\n",
-   timer_irq);
-   clockevents_config_and_register(_clockevent, MFGPT_HZ,
-   0xF, 0xFFFE);
-
return 0;
 
 err_irq:
-- 
2.7.0.198.g6dd47b6

4.4.1 regression from 4.1.x: Soekris net5501 crash in IRQ after mfgpt timer initialization

2016-02-02 Thread Nix

[Cc:ed Thomas on the vague hope that maybe this is osmething to do with
 the IRQ subsystem in general, though I doubt it, since only the one
 machine is crashing for me: it's probably the CS5531's interactions
 with said subsystem at fault.]

So I just upgraded from 4.1 to 4.4.1, secure in the knowledge that some
disaster would overtake me. And, indeed, my Soekris net5501 firewall no
longer boots, with this oops (full log and .config below):

[1.589543] cs5535-clockevt: Registering MFGPT timer as a clock event, using 
IRQ 7
[1.604921] BUG: unable to handle kernel NULL pointer dereference at   (null)
[1.604929] IP: [<  (null)>]   (null)
[1.604936] *pde = 
[1.604945] Oops:  [#1]
[1.604960] CPU: 0 PID: 1 Comm: swapper Not tainted 4.4.1+ #1
[1.604969] task: df44c000 ti: df436000 task.ti: df436000
[1.604978] EIP: 0060:[<>] EFLAGS: 00010093 CPU: 0
[1.604985] EIP is at 0x0
[1.604993] EAX: c048d760 EBX: df406540 ECX:  EDX: 620c
[1.605002] ESI: 0007 EDI: c048d720 EBP: df409fbc ESP: df409fb8
[1.605011]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[1.605019] CR0: 8005003b CR2:  CR3: 004d5000 CR4: 0090
[1.605022] Stack:
[1.605044]  c02cf141 df409fdc c013f40d  0001  df406540 
df406550
[1.605066]  c0140f94 df409fe8 c013f4d3 df406540 df409ff8 c0141005 df40bf70 
df406540
[1.605074]  df40bf78 c0102c9b
[1.605077] Call Trace:
[1.605101]  [] ? mfgpt_tick+0x6e/0x77
[1.605127]  [] handle_irq_event_percpu+0x26/0xcf
[1.605141]  [] ? handle_simple_irq+0x46/0x46
[1.605157]  [] handle_irq_event+0x1d/0x29
[1.605170]  [] handle_level_irq+0x71/0x95
[1.605187]  [] handle_irq+0x44/0x56
[1.605204]  
[1.605206]  [] do_IRQ+0x32/0x99
[1.605228]  [] common_interrupt+0x29/0x30
[1.605246]  [] ? __setup_irq+0x1ff/0x3b6
[1.605263]  [] ? __do_softirq+0x53/0x154
[1.605278]  [] ? _local_bh_enable+0x3a/0x3a
[1.605292]  [] do_softirq_own_stack+0x1c/0x22
[1.605309]  
[1.605311]  [] irq_exit+0x31/0x67
[1.605323]  [] do_IRQ+0x86/0x99
[1.605339]  [] common_interrupt+0x29/0x30
[1.605362]  [] ? try_to_grab_pending+0x38/0xec
[1.605378]  [] ? console_unlock+0x211/0x39b
[1.605394]  [] vprintk_emit+0x2b4/0x2be
[1.605410]  [] ? init_hrt_clocksource+0xa4/0xa4
[1.605426]  [] vprintk_default+0x12/0x14
[1.605446]  [] printk+0x11/0x13
[1.605462]  [] cs5535_mfgpt_init+0xce/0xf1
[1.605480]  [] do_one_initcall+0xce/0x13e
[1.605496]  [] ? parse_args+0x1bd/0x281
[1.605514]  [] ? kernel_init_freeable+0xb9/0x156
[1.605531]  [] kernel_init_freeable+0xd9/0x156
[1.60]  [] kernel_init+0x8/0xb5
[1.605571]  [] ret_from_kernel_thread+0x20/0x34
[1.605585]  [] ? rest_init+0x59/0x59
[1.605598] Code:  Bad EIP value.
[1.605610] EIP: [<>] 0x0 SS:ESP 0068:df409fb8
[1.605615] CR2: 
[1.605647] ---[ end trace 24cdf30ee51cea99 ]---
[1.605654] Kernel panic - not syncing: Fatal exception in interrupt
[1.605659] Kernel Offset: disabled
[1.614882] Rebooting in 5 seconds..

As far as I can tell, it's not actually crashing *in*
handle_irq_event_percpu(): at least, piling printk()s in there shows
nothing, and piling printk()s in mfgpt_tick() shows successful
completion: that's clearly residue from a previous frame. I'm not sure
*where* it's crashing at this point. In particular, cs5535_mfgpt_init()
only calls platform_driver_register(): it never calls printk(), but both
are shown as non-?. So I am somewhat at a loss with this one so far: the
stacktrace looks more or less impossible, even quite a long way up. I
guess I'll be reduced to bisection :(

I'm using a GCC 4.9.4 prerelease (dated 20151031) which has built
multiple (4.1) kernels perfectly happily before now and has built 4.4.1
on two other machines without visible problem.

Full boot log:

POST: 012345689bcefghips1234ajklnopqr,,,tvwxy








comBIOS ver. 1.33  20070103  Copyright (C) 2000-2007 Soekris Engineering.

net5501

 CPU Geode LX 500 Mhz  Mbyte 
Memory0512

Pri Mas  SanDisk SDCFH2-002G LBA Xlt 992-64-63  2001 Mbyte

Slot   Vend Dev  ClassRev Cmd  Stat CL LT HT  Base1Base2   Int 
---
0:01:2 1022 2082 1010 0006 0220 08 00 00 A000  10
0:06:0 1106 3053 0296 0117 0210 08 40 00 E101 A0004000 11
0:07:0 1106 3053 0296 0117 0210 08 40 00 E201 A0004100 05
0:08:0 1106 3053 0296 0117 0210 08 40 00 E301 A0004200 09
0:09:0 1106 3053 0296 0117 0210 08 40 00 E401 A0004300 12
0:14:0 104C AC23 06040002 0107 0210 08 40 01   
0:20:0 1022 2090 06010003 0009 02A0 08 40 80 6001 6101 
0:20:2 1022 209A 01018001 0005 02A0 08 00 00   
0:21:0 1022 2094 0C031002 0006 0230 08 00 80 A0005000  15
0:21:1 1022 2095 0C032002 0006 0230

Re: 4.4.1 regression from 4.1.x: Soekris net5501 crash in IRQ after mfgpt timer initialization

2016-02-02 Thread Nix

On 2 Feb 2016, Thomas Gleixner verbalised:

> On Tue, 2 Feb 2016, Nix wrote:
>> The fairly trivial code motion below also seems to work, and may be more
>> like an actual fix, though I'm a bit horrified that it's this simple. I
>> may well have moved too much and unknowingly violated some invariant.
>
> I was lazy and did not do this, because it wreckages the error pathes. So I
> went for the workaround in the hope that the authors of that stuff will take
> care :)

Oh true.

As far as I can tell, getting this right requires a function that
reverses the effect of clockevents_config_and_register(), which does not
appear to exist yet :( everything else appears more or less
reversible...

-- 
NULL && (void)

Re: [PATCH] PCI: Expand quirk's handling of CS553x devices

2015-02-03 Thread Nix

On 3 Feb 2015, Myron Stowe told this:

> There seem to be a number of issues with CS553x devices and due to a
> recent patch series that detects PCI read-only BARs [1], we've encountered
> more.
>
> It appears that not only are the BAR values associated with this device
> often greater than the largest range that an IO decoder can request, they
> can also be non-conformant with respect to PCI's BAR sizing aspects,
> behaving instead, in a read-only manner [2].
>
> This patch addresses read-only BAR values corresponding to CS553x devices
> by expanding the existing quirk, manually inserting regions based on the
> device's BIOS settings (as opposed to basing such on normal BAR sizing
> actions) when necessary.

Looks good!

[0.270107] PCI: Probing PCI hardware
[0.280187] PCI host bridge to bus :00
[0.290028] pci_bus :00: root bus resource [io  0x-0x]
[0.300021] pci_bus :00: root bus resource [mem 0x-0x]
[0.310018] pci_bus :00: No busn resource found for root bus, will use 
[bus 00-ff]
[0.325514] pci :00:14.0: [Firmware Bug]: CS5536 ISA bridge quirk: reg 
0x10: [io  0x6000-0x6007]
[0.330042] pci :00:14.0: [Firmware Bug]: CS5536 ISA bridge quirk: reg 
0x14: [io  0x6100-0x61ff]
[0.340039] pci :00:14.0: [Firmware Bug]: CS5536 ISA bridge quirk: reg 
0x18: [io  0x6200-0x63ff]
[0.350017] pci :00:14.0: CS5536 ISA bridge bug detected (incorrect 
header); workaround applied
[0.361456] pci :00:14.2: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
[0.370019] pci :00:14.2: legacy IDE quirk: reg 0x14: [io  0x03f6]
[0.380019] pci :00:14.2: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
[0.390017] pci :00:14.2: legacy IDE quirk: reg 0x1c: [io  0x0376]
[0.405842] pci :00:0e.0: PCI bridge to [bus 01]
[0.412043] Switched to clocksource pit
[...]
[0.780013] cs5535-gpio cs5535-gpio: reserved resource region [io  
0x6100-0x61ff]
[0.785102] cs5535-mfgpt cs5535-mfgpt: reserved resource region [io  
0x6200-0x63ff]
[0.801002] cs5535-mfgpt cs5535-mfgpt: 8 MFGPT timers available
[0.806684] cs5535-mfd :00:14.0: 5 devices registered.
[...]
[1.451754] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0' registered
[1.452515] pc87360: Device 0x09 not activated
[1.470755] cs5535-mfgpt cs5535-mfgpt: registered timer 0
[1.473869] Geode LX AES :00:01.2: GEODE AES engine enabled.
[1.48] cs5535-mfgpt cs5535-mfgpt: registered timer 1
[1.492402] cs5535-clockevt: Registering MFGPT timer as a clock event, using 
IRQ 7
[...]
[1.621402] Switched to clocksource tsc

nix@fold 3 /home/nix% grep cs5535 /proc/timer_list
Clock Event Device: cs5535-clockevt
nix@fold 4 /home/nix% ls -l /dev/watchdog
crw--- 1 root root 10, 130 Feb  4 00:14 /dev/watchdog

Thank you for such a prompt fix on hardware as obscure as this :)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.18.3 BISECTED REGRESSION] scx200_acb / cs5535-smb / geodewdt / cs5535-clockevt torpedoed

2015-02-03 Thread Nix

On 3 Feb 2015, Myron Stowe uttered the following:
> As expressed above, I believe this device is non-conformant.  This could be
> validated by instrumenting the kernel's sizing code in '__pci_read_base()';
> specifically the initial 'pci_read_config_dword()'s of 'l' and 'sz'.
> 
> Whereas we have been able to ignore read-only BARs in past occurrances, this
> time they are needed.  As such, I think we can solve this issue by expanding
> the existing quirk for this device.  I'll work that up and post.  If you
> would, please apply and test what is posted and report back.  In the mean
> time, I would be interested in obtaining confirmation as to my belief that
> this device's BARs are read-only (i.e. the instrumentation mentioned).  If
> you have time and are willing I would appreaciate that.  If not, I'm
> confident that is what is occurring and you can just stick with applying and
> testing the expanded quirk patch I intend to post.

Here, and the device behind it, just in case it's useful:

[1.050352] pci :00:14.0: [1022:2090] type 00 class 0x060100
[1.060152] name: :00:14.0; l: 6001; sz: 6001
[1.070037] pci :00:14.0: reg 0x10: [io  0x6000-0x7fff]
[1.080114] name: :00:14.0; l: 6101; sz: 6101
[1.090037] pci :00:14.0: reg 0x14: [io  0x6100-0x61ff]
[1.100113] name: :00:14.0; l: 6201; sz: 6201
[1.110036] pci :00:14.0: reg 0x18: [io  0x6200-0x63ff]
[1.120113] name: :00:14.0; l: 0; sz: 0
[1.130130] name: :00:14.0; l: 0; sz: 0
[1.140129] name: :00:14.0; l: 0; sz: 0
[1.150129] name: :00:14.0; l: 0; sz: 0
[1.160075] pci :00:14.0: CS5536 ISA bridge bug detected (incorrect 
header); workaround applied
[1.180018] pci :00:14.2: [1022:209a] type 00 class 0x010180
[1.190155] name: :00:14.2; l: 0; sz: 0
[1.200133] name: :00:14.2; l: 0; sz: 0
[1.210133] name: :00:14.2; l: 0; sz: 0
[1.220133] name: :00:14.2; l: 0; sz: 0
[1.230150] name: :00:14.2; l: e001; sz: fff1
[1.240037] pci :00:14.2: reg 0x20: [io  0xe000-0xe00f]
[1.250115] name: :00:14.2; l: 0; sz: 0
[1.260133] name: :00:14.2; l: 0; sz: 0
[1.270091] pci :00:14.2: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
[1.280021] pci :00:14.2: legacy IDE quirk: reg 0x14: [io  0x03f6]
[1.290019] pci :00:14.2: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
[1.300018] pci :00:14.2: legacy IDE quirk: reg 0x1c: [io  0x0376]

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.18.3 BISECTED REGRESSION] scx200_acb / cs5535-smb / geodewdt / cs5535-clockevt torpedoed

2015-02-03 Thread Nix

On 3 Feb 2015, Myron Stowe uttered the following:
 As expressed above, I believe this device is non-conformant.  This could be
 validated by instrumenting the kernel's sizing code in '__pci_read_base()';
 specifically the initial 'pci_read_config_dword()'s of 'l' and 'sz'.
 
 Whereas we have been able to ignore read-only BARs in past occurrances, this
 time they are needed.  As such, I think we can solve this issue by expanding
 the existing quirk for this device.  I'll work that up and post.  If you
 would, please apply and test what is posted and report back.  In the mean
 time, I would be interested in obtaining confirmation as to my belief that
 this device's BARs are read-only (i.e. the instrumentation mentioned).  If
 you have time and are willing I would appreaciate that.  If not, I'm
 confident that is what is occurring and you can just stick with applying and
 testing the expanded quirk patch I intend to post.

Here, and the device behind it, just in case it's useful:

[1.050352] pci :00:14.0: [1022:2090] type 00 class 0x060100
[1.060152] name: :00:14.0; l: 6001; sz: 6001
[1.070037] pci :00:14.0: reg 0x10: [io  0x6000-0x7fff]
[1.080114] name: :00:14.0; l: 6101; sz: 6101
[1.090037] pci :00:14.0: reg 0x14: [io  0x6100-0x61ff]
[1.100113] name: :00:14.0; l: 6201; sz: 6201
[1.110036] pci :00:14.0: reg 0x18: [io  0x6200-0x63ff]
[1.120113] name: :00:14.0; l: 0; sz: 0
[1.130130] name: :00:14.0; l: 0; sz: 0
[1.140129] name: :00:14.0; l: 0; sz: 0
[1.150129] name: :00:14.0; l: 0; sz: 0
[1.160075] pci :00:14.0: CS5536 ISA bridge bug detected (incorrect 
header); workaround applied
[1.180018] pci :00:14.2: [1022:209a] type 00 class 0x010180
[1.190155] name: :00:14.2; l: 0; sz: 0
[1.200133] name: :00:14.2; l: 0; sz: 0
[1.210133] name: :00:14.2; l: 0; sz: 0
[1.220133] name: :00:14.2; l: 0; sz: 0
[1.230150] name: :00:14.2; l: e001; sz: fff1
[1.240037] pci :00:14.2: reg 0x20: [io  0xe000-0xe00f]
[1.250115] name: :00:14.2; l: 0; sz: 0
[1.260133] name: :00:14.2; l: 0; sz: 0
[1.270091] pci :00:14.2: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
[1.280021] pci :00:14.2: legacy IDE quirk: reg 0x14: [io  0x03f6]
[1.290019] pci :00:14.2: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
[1.300018] pci :00:14.2: legacy IDE quirk: reg 0x1c: [io  0x0376]

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] PCI: Expand quirk's handling of CS553x devices

2015-02-03 Thread Nix

On 3 Feb 2015, Myron Stowe told this:

 There seem to be a number of issues with CS553x devices and due to a
 recent patch series that detects PCI read-only BARs [1], we've encountered
 more.

 It appears that not only are the BAR values associated with this device
 often greater than the largest range that an IO decoder can request, they
 can also be non-conformant with respect to PCI's BAR sizing aspects,
 behaving instead, in a read-only manner [2].

 This patch addresses read-only BAR values corresponding to CS553x devices
 by expanding the existing quirk, manually inserting regions based on the
 device's BIOS settings (as opposed to basing such on normal BAR sizing
 actions) when necessary.

Looks good!

[0.270107] PCI: Probing PCI hardware
[0.280187] PCI host bridge to bus :00
[0.290028] pci_bus :00: root bus resource [io  0x-0x]
[0.300021] pci_bus :00: root bus resource [mem 0x-0x]
[0.310018] pci_bus :00: No busn resource found for root bus, will use 
[bus 00-ff]
[0.325514] pci :00:14.0: [Firmware Bug]: CS5536 ISA bridge quirk: reg 
0x10: [io  0x6000-0x6007]
[0.330042] pci :00:14.0: [Firmware Bug]: CS5536 ISA bridge quirk: reg 
0x14: [io  0x6100-0x61ff]
[0.340039] pci :00:14.0: [Firmware Bug]: CS5536 ISA bridge quirk: reg 
0x18: [io  0x6200-0x63ff]
[0.350017] pci :00:14.0: CS5536 ISA bridge bug detected (incorrect 
header); workaround applied
[0.361456] pci :00:14.2: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
[0.370019] pci :00:14.2: legacy IDE quirk: reg 0x14: [io  0x03f6]
[0.380019] pci :00:14.2: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
[0.390017] pci :00:14.2: legacy IDE quirk: reg 0x1c: [io  0x0376]
[0.405842] pci :00:0e.0: PCI bridge to [bus 01]
[0.412043] Switched to clocksource pit
[...]
[0.780013] cs5535-gpio cs5535-gpio: reserved resource region [io  
0x6100-0x61ff]
[0.785102] cs5535-mfgpt cs5535-mfgpt: reserved resource region [io  
0x6200-0x63ff]
[0.801002] cs5535-mfgpt cs5535-mfgpt: 8 MFGPT timers available
[0.806684] cs5535-mfd :00:14.0: 5 devices registered.
[...]
[1.451754] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0' registered
[1.452515] pc87360: Device 0x09 not activated
[1.470755] cs5535-mfgpt cs5535-mfgpt: registered timer 0
[1.473869] Geode LX AES :00:01.2: GEODE AES engine enabled.
[1.48] cs5535-mfgpt cs5535-mfgpt: registered timer 1
[1.492402] cs5535-clockevt: Registering MFGPT timer as a clock event, using 
IRQ 7
[...]
[1.621402] Switched to clocksource tsc

nix@fold 3 /home/nix% grep cs5535 /proc/timer_list
Clock Event Device: cs5535-clockevt
nix@fold 4 /home/nix% ls -l /dev/watchdog
crw--- 1 root root 10, 130 Feb  4 00:14 /dev/watchdog

Thank you for such a prompt fix on hardware as obscure as this :)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] sunrpc: NULL utsname dereference on NFS umount during namespace cleanup

2015-02-02 Thread Nix

On 2 Feb 2015, Trond Myklebust verbalised:
> Hmm... I'm at a loss to see how rpcb_create can ever call
> rpc_new_client() with a null value for the nodename with that patch
> applied. Are you 100% sure that the above Oops came from a patched
> kernel? That IP address of "rpc_new_client+0x13b/0x1f2" looks
> identical to the one in your original posting.

I've been swapping kernels a lot of late due to bisection -- it is
perfectly possible that I somehow ended up running an unpatched one :/

I'll do a build from scratch with the patch and reboot into it. My
apologies if this was a false positive (which it looks quite like it
might have been: your evidence is pretty persuasive!)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] sunrpc: NULL utsname dereference on NFS umount during namespace cleanup

2015-02-02 Thread Nix

On 31 Jan 2015, n...@esperi.org.uk told this:
> I'll let it run overnight and give it a reboot in the morning.

Alas, my latest reboot hit:

[  215.245158] BUG: unable to handle kernel NULL pointer dereference at 0004
[  215.251602] IP: [] rpc_new_client+0x13b/0x1f2
[  215.251602] *pde = 
[  215.251602] Oops:  [#1]
[  215.251602] CPU: 0 PID: 398 Comm: bash Not tainted 3.18.5+ #1
[  215.251602] task: de1fcfc0 ti: de1fa000 task.ti: de1fa000
[  215.251602] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
[  215.251602] EIP is at rpc_new_client+0x13b/0x1f2
[  215.251602] EAX:  EBX: df70f000 ECX: bae0 EDX: 0005
[  215.251602] ESI:  EDI: df6cc000 EBP: 0007 ESP: de1fbcac
[  215.251602]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[  215.251602] CR0: 8005003b CR2: 0004 CR3: 1f645000 CR4: 0090
[  215.251602] Stack:
[  215.251602]  00d0 df6cc000 de1fbd4c  de1fbd4c de1fbd4c de1fbd10 
de1fbf8c
[  215.251602]  c0350196 de1fbd4c de2056d0 de1fbd10 c0350300 c0262e32 c0263841 
df528000
[  215.251602]  a2e0b542 0006 c044b178  de1fbddc 0010 de2056d0 

[  215.251602] Call Trace:
[  215.251602]  [] ? rpc_create_xprt+0xc/0x74
[  215.251602]  [] ? rpc_create+0x102/0x10f
[  215.251602]  [] ? ata_sff_check_status+0x8/0x9
[  215.251602]  [] ? ata_dev_select.constprop.20+0x83/0x95
[  215.251602]  [] ? __block_commit_write.isra.25+0x56/0x7f
[  215.251602]  [] ? rpcb_create+0x6e/0x7c
[  215.251602]  [] ? rpcb_getport_async+0x124/0x25a
[  215.251602]  [] ? update_curr+0x81/0xb3
[  215.251602]  [] ? check_preempt_wakeup+0xf0/0x134
[  215.251602]  [] ? check_preempt_curr+0x21/0x59
[  215.251602]  [] ? rpcauth_lookupcred+0x3f/0x47
[  215.251602]  [] ? __kmalloc+0xa3/0xc4
[  215.251602]  [] ? rpc_malloc+0x39/0x48
[  215.251602]  [] ? call_bind+0x2d/0x2e
[  215.251602]  [] ? __rpc_execute+0x5c/0x187
[  215.251602]  [] ? rpc_run_task+0x55/0x5a
[  215.251602]  [] ? rpc_call_sync+0x69/0x81
[  215.251602]  [] ? nsm_mon_unmon+0x8c/0xa0
[  215.251602]  [] ? nsm_unmonitor+0x5f/0xd3
[  215.251602]  [] ? bdi_unregister+0xf2/0x100
[  215.251602]  [] ? nlm_destroy_host_locked+0x4f/0x7c
[  215.251602]  [] ? nlmclnt_release_host+0xd8/0xe5
[  215.251602]  [] ? nlmclnt_done+0xc/0x14
[  215.251602]  [] ? nfs_free_server+0x16/0x72
[  215.251602]  [] ? deactivate_locked_super+0x26/0x37
[  215.251602]  [] ? cleanup_mnt+0x40/0x59
[  215.251602]  [] ? task_work_run+0x4f/0x5f
[  215.251602]  [] ? do_exit+0x264/0x670
[  215.251602]  [] ? __set_current_blocked+0xd/0xf
[  215.251602]  [] ? sigprocmask+0x77/0x87
[  215.251602]  [] ? __task_pid_nr_ns+0x3a/0x41
[  215.251602]  [] ? find_vpid+0xd/0x17
[  215.251602]  [] ? do_group_exit+0x2b/0x5d
[  215.251602]  [] ? SyS_exit_group+0xf/0xf
[  215.251602]  [] ? syscall_call+0x7/0x7
[  215.251602] Code: 89 43 40 8b 44 24 04 89 43 18 8d 43 78 8b 53 40 89 43 3c 
8b 12 e8 32 ac 00 00 c7 03 01 00 00 00 a1 dc f0 42 c0 8b 80 00 03 00 00 <8b> 70 
04 83 c6 45 89 f0 e8 ab d1 eb ff 83 f8 20 7f 05 89 43 44
[  215.251602] EIP: [] rpc_new_client+0x13b/0x1f2 SS:ESP 0068:de1fbcac
[  215.251602] CR2: 0004
[  215.251602] ---[ end trace 4b9e971f6b3f2dc8 ]---
[  215.251602] Kernel panic - not syncing: Fatal exception
[  215.251602] Kernel Offset: 0x0 from 0xc010 (relocation range: 
0xc000-0xe07f)
[  215.251602] Rebooting in 5 seconds..

so clearly the bug is still there; also the connection I thought might
exist with the "xprt_adjust_timeout: rq_timeout = 0!" was illusory: that
message hadn't recurred since before the last reboot but one, but the
crash happened anyway.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.18.3 BISECTED REGRESSION] scx200_acb / cs5535-smb / geodewdt / cs5535-clockevt torpedoed

2015-02-02 Thread Nix

On 2 Feb 2015, Myron Stowe verbalised:

> Nix:
>
> Thanks for the work you've already done with the bisection.  Let's see
> if we can get to the bottom of this.  Would you capture two couple sets
> of logs, one without the issue and another set with the commit at issue
> included for comparison.

> For each set please capture:
> a 'dmesg' log with the kernel boot parameters 'debug' and
> 'ignore_loglevel' added (the entire log from booting),

Good boot (commit reverted):

[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.18.5+ (compiler@fold) (gcc version 4.8.4 
20140605 (prerelease) (GCC) ) #1 Fri Jan 30 21:05:54 GMT 2015
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1fff] usable
[0.00] BIOS-e820: [mem 0xfff0-0x] reserved
[0.00] debug: ignoring loglevel setting.
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x2 max_arch_pfn = 0x10
[0.00] initial memory mapped: [mem 0x-0x007f]
[0.00] Base memory trampoline at [c009b000] 9b000 size 16384
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 0x-0x000f] page 4k
[0.00] init_memory_mapping: [mem 0x1fc0-0x1fff]
[0.00]  [mem 0x1fc0-0x1fff] page 2M
[0.00] init_memory_mapping: [mem 0x1800-0x1fbf]
[0.00]  [mem 0x1800-0x1fbf] page 2M
[0.00] init_memory_mapping: [mem 0x0010-0x17ff]
[0.00]  [mem 0x0010-0x003f] page 4k
[0.00]  [mem 0x0040-0x17ff] page 2M
[0.00] 512MB LOWMEM available.
[0.00]   mapped low ram: 0 - 2000
[0.00]   low ram: 0 - 2000
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   Normal   [mem 0x0100-0x1fff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x1000-0x0009efff]
[0.00]   node   0: [mem 0x0010-0x1fff]
[0.00] Initmem setup node 0 [mem 0x1000-0x1fff]
[0.00] On node 0 totalpages: 130974
[0.00] free_area_init_node: node 0, pgdat c0450eac, node_mem_map 
dfc00020
[0.00]   DMA zone: 32 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 3998 pages, LIFO batch:0
[0.00]   Normal zone: 992 pages used for memmap
[0.00]   Normal zone: 126976 pages, LIFO batch:31
[0.00] Using APIC driver default
[0.00] No local APIC present or hardware disabled
[0.00] APIC: disable apic facility
[0.00] APIC: switched to apic NOOP
[0.00] e820: [mem 0x2000-0xffef] available for PCI devices
[0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[0.00] pcpu-alloc: [0] 0
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 129950
[0.00] Kernel command line: BOOT_IMAGE=Linux console=ttyS0,19200 
root=/dev/sda1 
netconsole=24183@192.168.14.1/eth0,24183@192.168.14.15/00:e0:81:c0:91:1b 
ignore_loglevel debug
[0.00] PID hash table entries: 2048 (order: 1, 8192 bytes)
[0.00] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[0.00] Initializing CPU#0
[0.00] Memory: 515520K/523896K available (2466K kernel code, 239K 
rwdata, 720K rodata, 208K init, 152K bss, 8376K reserved)
[0.00] virtual kernel memory layout:
[0.00] fixmap  : 0xfffa4000 - 0xf000   ( 364 kB)
[0.00] vmalloc : 0xe080 - 0xfffa2000   ( 503 MB)
[0.00] lowmem  : 0xc000 - 0xe000   ( 512 MB)
[0.00]   .init : 0xc045b000 - 0xc048f000   ( 208 kB)
[0.00]   .data : 0xc0368dd5 - 0xc0459cc0   ( 963 kB)
[0.00]   .text : 0xc010 - 0xc0368dd5   (2467 kB)
[0.00] Checking if this processor honours the WP bit even in supervisor 
mode...Ok.
[0.00] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[0.00] NR_IRQS:2304 nr_irqs:24 0
[0.00] CPU 0 irqstacks, hard=df408000 soft=df40a000
[0.00] console [ttyS0] enabled
[0.00] tsc: Fast TSC calibration using PIT
[0.00] tsc: Detected 499.914 MHz processor
[0.030019] Calibrating delay loop (skipped), value calculated using timer 
frequency.

Re: [Patch] sunrpc: NULL utsname dereference on NFS umount during namespace cleanup

2015-02-02 Thread Nix

On 2 Feb 2015, Trond Myklebust verbalised:
 Hmm... I'm at a loss to see how rpcb_create can ever call
 rpc_new_client() with a null value for the nodename with that patch
 applied. Are you 100% sure that the above Oops came from a patched
 kernel? That IP address of rpc_new_client+0x13b/0x1f2 looks
 identical to the one in your original posting.

I've been swapping kernels a lot of late due to bisection -- it is
perfectly possible that I somehow ended up running an unpatched one :/

I'll do a build from scratch with the patch and reboot into it. My
apologies if this was a false positive (which it looks quite like it
might have been: your evidence is pretty persuasive!)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] sunrpc: NULL utsname dereference on NFS umount during namespace cleanup

2015-02-02 Thread Nix

On 31 Jan 2015, n...@esperi.org.uk told this:
 I'll let it run overnight and give it a reboot in the morning.

Alas, my latest reboot hit:

[  215.245158] BUG: unable to handle kernel NULL pointer dereference at 0004
[  215.251602] IP: [c034fb8c] rpc_new_client+0x13b/0x1f2
[  215.251602] *pde = 
[  215.251602] Oops:  [#1]
[  215.251602] CPU: 0 PID: 398 Comm: bash Not tainted 3.18.5+ #1
[  215.251602] task: de1fcfc0 ti: de1fa000 task.ti: de1fa000
[  215.251602] EIP: 0060:[c034fb8c] EFLAGS: 00010246 CPU: 0
[  215.251602] EIP is at rpc_new_client+0x13b/0x1f2
[  215.251602] EAX:  EBX: df70f000 ECX: bae0 EDX: 0005
[  215.251602] ESI:  EDI: df6cc000 EBP: 0007 ESP: de1fbcac
[  215.251602]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[  215.251602] CR0: 8005003b CR2: 0004 CR3: 1f645000 CR4: 0090
[  215.251602] Stack:
[  215.251602]  00d0 df6cc000 de1fbd4c  de1fbd4c de1fbd4c de1fbd10 
de1fbf8c
[  215.251602]  c0350196 de1fbd4c de2056d0 de1fbd10 c0350300 c0262e32 c0263841 
df528000
[  215.251602]  a2e0b542 0006 c044b178  de1fbddc 0010 de2056d0 

[  215.251602] Call Trace:
[  215.251602]  [c0350196] ? rpc_create_xprt+0xc/0x74
[  215.251602]  [c0350300] ? rpc_create+0x102/0x10f
[  215.251602]  [c0262e32] ? ata_sff_check_status+0x8/0x9
[  215.251602]  [c0263841] ? ata_dev_select.constprop.20+0x83/0x95
[  215.251602]  [c019ec6a] ? __block_commit_write.isra.25+0x56/0x7f
[  215.251602]  [c0359f04] ? rpcb_create+0x6e/0x7c
[  215.251602]  [c035a677] ? rpcb_getport_async+0x124/0x25a
[  215.251602]  [c0133bdf] ? update_curr+0x81/0xb3
[  215.251602]  [c0133e97] ? check_preempt_wakeup+0xf0/0x134
[  215.251602]  [c013110f] ? check_preempt_curr+0x21/0x59
[  215.251602]  [c0355294] ? rpcauth_lookupcred+0x3f/0x47
[  215.251602]  [c017ebb6] ? __kmalloc+0xa3/0xc4
[  215.251602]  [c0354c09] ? rpc_malloc+0x39/0x48
[  215.251602]  [c034ef05] ? call_bind+0x2d/0x2e
[  215.251602]  [c0354a71] ? __rpc_execute+0x5c/0x187
[  215.251602]  [c03500be] ? rpc_run_task+0x55/0x5a
[  215.251602]  [c035012c] ? rpc_call_sync+0x69/0x81
[  215.251602]  [c01e219e] ? nsm_mon_unmon+0x8c/0xa0
[  215.251602]  [c01e23c5] ? nsm_unmonitor+0x5f/0xd3
[  215.251602]  [c016b3cc] ? bdi_unregister+0xf2/0x100
[  215.251602]  [c01df4cc] ? nlm_destroy_host_locked+0x4f/0x7c
[  215.251602]  [c01df703] ? nlmclnt_release_host+0xd8/0xe5
[  215.251602]  [c015] ? nlmclnt_done+0xc/0x14
[  215.251602]  [c01ce621] ? nfs_free_server+0x16/0x72
[  215.251602]  [c01837a4] ? deactivate_locked_super+0x26/0x37
[  215.251602]  [c019493e] ? cleanup_mnt+0x40/0x59
[  215.251602]  [c012e185] ? task_work_run+0x4f/0x5f
[  215.251602]  [c0121373] ? do_exit+0x264/0x670
[  215.251602]  [c0127a2b] ? __set_current_blocked+0xd/0xf
[  215.251602]  [c0127b26] ? sigprocmask+0x77/0x87
[  215.251602]  [c012e097] ? __task_pid_nr_ns+0x3a/0x41
[  215.251602]  [c012e019] ? find_vpid+0xd/0x17
[  215.251602]  [c01217cc] ? do_group_exit+0x2b/0x5d
[  215.251602]  [c012180d] ? SyS_exit_group+0xf/0xf
[  215.251602]  [c03679b6] ? syscall_call+0x7/0x7
[  215.251602] Code: 89 43 40 8b 44 24 04 89 43 18 8d 43 78 8b 53 40 89 43 3c 
8b 12 e8 32 ac 00 00 c7 03 01 00 00 00 a1 dc f0 42 c0 8b 80 00 03 00 00 8b 70 
04 83 c6 45 89 f0 e8 ab d1 eb ff 83 f8 20 7f 05 89 43 44
[  215.251602] EIP: [c034fb8c] rpc_new_client+0x13b/0x1f2 SS:ESP 0068:de1fbcac
[  215.251602] CR2: 0004
[  215.251602] ---[ end trace 4b9e971f6b3f2dc8 ]---
[  215.251602] Kernel panic - not syncing: Fatal exception
[  215.251602] Kernel Offset: 0x0 from 0xc010 (relocation range: 
0xc000-0xe07f)
[  215.251602] Rebooting in 5 seconds..

so clearly the bug is still there; also the connection I thought might
exist with the xprt_adjust_timeout: rq_timeout = 0! was illusory: that
message hadn't recurred since before the last reboot but one, but the
crash happened anyway.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.18.3 BISECTED REGRESSION] scx200_acb / cs5535-smb / geodewdt / cs5535-clockevt torpedoed

2015-02-02 Thread Nix

On 2 Feb 2015, Myron Stowe verbalised:

 Nix:

 Thanks for the work you've already done with the bisection.  Let's see
 if we can get to the bottom of this.  Would you capture two couple sets
 of logs, one without the issue and another set with the commit at issue
 included for comparison.

 For each set please capture:
 a 'dmesg' log with the kernel boot parameters 'debug' and
 'ignore_loglevel' added (the entire log from booting),

Good boot (commit reverted):

[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.18.5+ (compiler@fold) (gcc version 4.8.4 
20140605 (prerelease) (GCC) ) #1 Fri Jan 30 21:05:54 GMT 2015
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1fff] usable
[0.00] BIOS-e820: [mem 0xfff0-0x] reserved
[0.00] debug: ignoring loglevel setting.
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] e820: update [mem 0x-0x0fff] usable == reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x2 max_arch_pfn = 0x10
[0.00] initial memory mapped: [mem 0x-0x007f]
[0.00] Base memory trampoline at [c009b000] 9b000 size 16384
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 0x-0x000f] page 4k
[0.00] init_memory_mapping: [mem 0x1fc0-0x1fff]
[0.00]  [mem 0x1fc0-0x1fff] page 2M
[0.00] init_memory_mapping: [mem 0x1800-0x1fbf]
[0.00]  [mem 0x1800-0x1fbf] page 2M
[0.00] init_memory_mapping: [mem 0x0010-0x17ff]
[0.00]  [mem 0x0010-0x003f] page 4k
[0.00]  [mem 0x0040-0x17ff] page 2M
[0.00] 512MB LOWMEM available.
[0.00]   mapped low ram: 0 - 2000
[0.00]   low ram: 0 - 2000
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   Normal   [mem 0x0100-0x1fff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x1000-0x0009efff]
[0.00]   node   0: [mem 0x0010-0x1fff]
[0.00] Initmem setup node 0 [mem 0x1000-0x1fff]
[0.00] On node 0 totalpages: 130974
[0.00] free_area_init_node: node 0, pgdat c0450eac, node_mem_map 
dfc00020
[0.00]   DMA zone: 32 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 3998 pages, LIFO batch:0
[0.00]   Normal zone: 992 pages used for memmap
[0.00]   Normal zone: 126976 pages, LIFO batch:31
[0.00] Using APIC driver default
[0.00] No local APIC present or hardware disabled
[0.00] APIC: disable apic facility
[0.00] APIC: switched to apic NOOP
[0.00] e820: [mem 0x2000-0xffef] available for PCI devices
[0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[0.00] pcpu-alloc: [0] 0
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 129950
[0.00] Kernel command line: BOOT_IMAGE=Linux console=ttyS0,19200 
root=/dev/sda1 
netconsole=24183@192.168.14.1/eth0,24183@192.168.14.15/00:e0:81:c0:91:1b 
ignore_loglevel debug
[0.00] PID hash table entries: 2048 (order: 1, 8192 bytes)
[0.00] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[0.00] Initializing CPU#0
[0.00] Memory: 515520K/523896K available (2466K kernel code, 239K 
rwdata, 720K rodata, 208K init, 152K bss, 8376K reserved)
[0.00] virtual kernel memory layout:
[0.00] fixmap  : 0xfffa4000 - 0xf000   ( 364 kB)
[0.00] vmalloc : 0xe080 - 0xfffa2000   ( 503 MB)
[0.00] lowmem  : 0xc000 - 0xe000   ( 512 MB)
[0.00]   .init : 0xc045b000 - 0xc048f000   ( 208 kB)
[0.00]   .data : 0xc0368dd5 - 0xc0459cc0   ( 963 kB)
[0.00]   .text : 0xc010 - 0xc0368dd5   (2467 kB)
[0.00] Checking if this processor honours the WP bit even in supervisor 
mode...Ok.
[0.00] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[0.00] NR_IRQS:2304 nr_irqs:24 0
[0.00] CPU 0 irqstacks, hard=df408000 soft=df40a000
[0.00] console [ttyS0] enabled
[0.00] tsc: Fast TSC calibration using PIT
[0.00] tsc: Detected 499.914 MHz processor
[0.030019] Calibrating delay loop (skipped), value calculated using timer 
frequency.. 999.82 BogoMIPS (lpj=4999140)
[0.050011

Re: [Patch] sunrpc: NULL utsname dereference on NFS umount during namespace cleanup

2015-01-30 Thread Nix

On 30 Jan 2015, Trond Myklebust uttered the following:

> On Sun, 2015-01-25 at 16:55 -0500, Trond Myklebust wrote:
>> On Sun, Jan 25, 2015 at 4:06 PM, Bruno Prémont
>>  wrote:
>> > On a system running home-brown container (mntns, utsns, pidns, netns)
>> > with NFS mount-point bind-mounted into the container I hit the following
>> > trace if nfs filesystem is first umount()ed in init ns and then later
>> > umounted from container when the container exists.

I'm not using containers, but I *am* extensively bind-mounting
NFS filesystems, which probably has the same net effect.

> I was rather hoping that Bruno would fix up his patch and resend, but
> since other reports of the same bug are now surfacing... Please could
> you all check if something like the following patch fixes it.

I have to wait for another of those xprt != 0 warnings, I think. I've
had a couple of clean reboots, but I had the occasional clean reboot
even before this patch.

I'll let it run overnight and give it a reboot in the morning.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

3.18.x: the return of the NFSv3 client BUG on reboot

2015-01-30 Thread Nix

I'm seeing these BUGs, fairly reliably, on multiple NFSv3 clients (x86
and x86-64) as of 3.18. As with the bug a few months ago, it seems to be
a side effect of trying to create a new client while shutting everything
down:

[  567.092093] BUG: unable to handle kernel NULL pointer dereference at 0004
[  567.100044] IP: [] rpc_new_client+0x13b/0x1f2
[  567.100044] *pde = 
[  567.100044] Oops:  [#1]
[  567.100044] CPU: 0 PID: 521 Comm: su Not tainted 3.18.5+ #1
[  567.100044] task: de28cfc0 ti: de2ba000 task.ti: de2ba000
[  567.100044] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
[  567.100044] EIP is at rpc_new_client+0x13b/0x1f2
[  567.100044] EAX:  EBX: c000c800 ECX: bae0 EDX: 0005
[  567.100044] ESI:  EDI: de2ea800 EBP: 0007 ESP: de2bbcac
[  567.100044]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[  567.100044] CR0: 8005003b CR2: 0004 CR3: 1f63b000 CR4: 0090
[  567.100044] Stack:
[  567.100044]  00d0 de2ea800 de2bbd4c  de2bbd4c de2bbd4c de2bbd10 
de2bbf8c
[  567.100044]  c03501ca de2bbd4c df57a940 de2bbd10 c0350334 0010 c0451150 
000280da
[  567.100044]  0002 0006 c044b178  de2bbddc 0010 df57a940 

[  567.100044] Call Trace:
[  567.100044]  [] ? rpc_create_xprt+0xc/0x74
[  567.100044]  [] ? rpc_create+0x102/0x10f
[  567.100044]  [] ? __block_commit_write.isra.25+0x56/0x7f
[  567.100044]  [] ? rpcb_create+0x6e/0x7c
[  567.100044]  [] ? rpcb_getport_async+0x124/0x25a
[  567.100044]  [] ? update_curr+0x81/0xb3
[  567.100044]  [] ? check_preempt_wakeup+0xf0/0x134
[  567.100044]  [] ? check_preempt_curr+0x21/0x59
[  567.100044]  [] ? rpcauth_lookupcred+0x3f/0x47
[  567.100044]  [] ? __kmalloc+0xa3/0xc4
[  567.100044]  [] ? rpc_malloc+0x39/0x48
[  567.100044]  [] ? call_bind+0x2d/0x2e
[  567.100044]  [] ? __rpc_execute+0x5c/0x187
[  567.100044]  [] ? rpc_run_task+0x55/0x5a
[  567.100044]  [] ? rpc_call_sync+0x69/0x81
[  567.100044]  [] ? nsm_mon_unmon+0x8c/0xa0
[  567.100044]  [] ? nsm_unmonitor+0x5f/0xd3
[  567.100044]  [] ? bdi_unregister+0xf2/0x100
[  567.100044]  [] ? nlm_destroy_host_locked+0x4f/0x7c
[  567.100044]  [] ? nlmclnt_release_host+0xd8/0xe5
[  567.100044]  [] ? nlmclnt_done+0xc/0x14
[  567.100044]  [] ? nfs_free_server+0x16/0x72
[  567.100044]  [] ? deactivate_locked_super+0x26/0x37
[  567.100044]  [] ? cleanup_mnt+0x40/0x59
[  567.100044]  [] ? task_work_run+0x4f/0x5f
[  567.100044]  [] ? do_exit+0x264/0x670
[  567.100044]  [] ? vfs_write+0x98/0x151
[  567.100044]  [] ? SyS_write+0x41/0x82
[  567.100044]  [] ? do_group_exit+0x2b/0x5d
[  567.100044]  [] ? SyS_exit_group+0xf/0xf
[  567.100044]  [] ? syscall_call+0x7/0x7
[  567.100044] Code: 89 43 40 8b 44 24 04 89 43 18 8d 43 78 8b 53 40 89 43 3c 
8b 12 e8 32 ac 00 00 c7 03 01 00 00 00 a1 dc f0 42 c0 8b 80 00 03 00 00 <8b> 70 
04 83 c6 45 89 f0 e8 77 d1 eb ff 83 f8 20 7f 05 89 43 44
[  567.100044] EIP: [] rpc_new_client+0x13b/0x1f2 SS:ESP 0068:de2bbcac
[  567.100044] CR2: 0004
[  567.100044] ---[ end trace 9d49b8a60cfa6a52 ]---
[  567.100044] Kernel panic - not syncing: Fatal exception
[  567.100044] Kernel Offset: 0x0 from 0xc010 (relocation range: 
0xc000-0xe07f)
[  567.100044] Rebooting in 5 seconds..

Notably, these only seem to happen after at least one instance of the
dreaded, frequently-occurring but normally harmless

[  101.019879] xprt_adjust_timeout: rq_timeout = 0!
[  101.047744] lockd: server spindle.srvr.nix not responding, still trying

(i.e. if you manage to reboot before one of these happens -- as happened
recently when I was bisecting another problem -- you don't seem to get a
panic).

NFS-related .config options:

CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
# CONFIG_NFS_V4_1 is not set
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y

If there's any more info I can provide, just ask. I can reproduce this
one quite fast and I have a box I can test on easily (it's my firewall
so I'm knocked off the net every time I do it, but it's not doing many
other jobs and I can reboot it without losing any state to speak of).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[3.18.3 BISECTED REGRESSION] scx200_acb / cs5535-smb / geodewdt / cs5535-clockevt torpedoed

2015-01-30 Thread Nix

As of 3.18.3, the watchdog timer, high-res timers, etc on my Soekris
net5501 no longer work.

dmesg says:

[1.465681] scx200_acb: can't allocate io 0x0-0x7
[1.473909] cs5535-smb: probe of cs5535-smb failed with error -5
[1.479991] pc87360: Device 0x09 not activated
[1.488237] geodewdt: No timers were available
[1.505612] Geode LX AES :00:01.2: GEODE AES engine enabled.
[1.511699] cs5535-clockevt: Could not allocate MFGPT timer

rather than the expected

[1.465987] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0' registered
[1.466747] pc87360: Device 0x09 not activated
[1.484989] cs5535-mfgpt cs5535-mfgpt: registered timer 0
[1.488096] Geode LX AES :00:01.2: GEODE AES engine enabled.
[1.504232] cs5535-mfgpt cs5535-mfgpt: registered timer 1
[1.506634] cs5535-clockevt: Registering MFGPT timer as a clock event, using 
IRQ 7

Bisected to:

commit efdb9b956aa06868a052f0d4387f5f34e2321e41
Author: Myron Stowe 
Date:   Thu Oct 30 11:54:37 2014 -0600

PCI: Restore detection of read-only BARs

commit 36e8164882ca6d3c41cb91e6f09a3ed236841f80 upstream.

before which everything works OK. (Bisection confirmed by reverting this
commit, following which everything works again.)

/proc/ioports in a good boot:

-001f : dma1
  - : cs5535-acpi
- : cs5535-pms
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-0060 : keyboard
0064-0064 : keyboard
0070-0071 : rtc_cmos
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : :00:14.2
01f0-01f7 : :00:14.2
  01f0-01f7 : pata_cs5536
0376-0376 : :00:14.2
03f6-03f6 : :00:14.2
  03f6-03f6 : pata_cs5536
03f8-03ff : serial
0cf8-0cff : PCI conf1
6000-6007 : cs5535-smb
  6000-6007 : :00:14.0
6000-6007 : CS5535 ACB0
6100-61ff : cs5535-gpio
  6100-61ff : :00:14.0
6100-61ff : cs5535-gpio
6200-63ff : cs5535-mfgpt
  6200-63ff : :00:14.0
6200-63ff : cs5535-mfgpt
6620-662f : pc87360
  6620-662f : pc87360
6640-664f : pc87360
  6640-664f : pc87360
d000-dfff : PCI Bus :01
  d000-d0ff : :01:00.0
d000-d0ff : via-rhine
  d100-d1ff : :01:01.0
d100-d1ff : via-rhine
  d200-d2ff : :01:02.0
d200-d2ff : via-rhine
  d300-d3ff : :01:03.0
d300-d3ff : via-rhine
e000-e00f : :00:14.2
  e000-e00f : pata_cs5536
e100-e1ff : :00:06.0
  e100-e1ff : via-rhine
e200-e2ff : :00:07.0
  e200-e2ff : via-rhine
e300-e3ff : :00:08.0
  e300-e3ff : via-rhine
e400-e4ff : :00:09.0
  e400-e4ff : via-rhine

Bad boot:

-001f : dma1
  -0007 : cs5535-smb
- : cs5535-acpi
  - : cs5535-pms
- : cs5535-mfgpt
  - : cs5535-gpio
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-0060 : keyboard
0064-0064 : keyboard
0070-0071 : rtc_cmos
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : :00:14.2
01f0-01f7 : :00:14.2
  01f0-01f7 : pata_cs5536
0376-0376 : :00:14.2
03f6-03f6 : :00:14.2
  03f6-03f6 : pata_cs5536
03f8-03ff : serial
0cf8-0cff : PCI conf1
6620-662f : pc87360
  6620-662f : pc87360
6640-664f : pc87360
  6640-664f : pc87360
d000-dfff : PCI Bus :01
  d000-d0ff : :01:00.0
d000-d0ff : via-rhine
  d100-d1ff : :01:01.0
d100-d1ff : via-rhine
  d200-d2ff : :01:02.0
d200-d2ff : via-rhine
  d300-d3ff : :01:03.0
d300-d3ff : via-rhine
e000-e00f : :00:14.2
  e000-e00f : pata_cs5536
e100-e1ff : :00:06.0
  e100-e1ff : via-rhine
e200-e2ff : :00:07.0
  e200-e2ff : via-rhine
e300-e3ff : :00:08.0
  e300-e3ff : via-rhine
e400-e4ff : :00:09.0
  e400-e4ff : via-rhine

A diff shows what I can only describe as an ioports traffic jam:

--- /tmp/good-ioports   2015-01-30 21:01:42.724692790 +
+++ /tmp/bad-ioports2015-01-30 21:01:51.803954107 +
@@ -1,6 +1,9 @@
 -001f : dma1
-  - : cs5535-acpi
-- : cs5535-pms
+  -0007 : cs5535-smb
+- : cs5535-acpi
+  - : cs5535-pms
+- : cs5535-mfgpt
+  - : cs5535-gpio
 0020-0021 : pic1
 0040-0043 : timer0
 0050-0053 : timer1
@@ -19,15 +22,6 @@
   03f6-03f6 : pata_cs5536
 03f8-03ff : serial
 0cf8-0cff : PCI conf1
-6000-6007 : cs5535-smb
-  6000-6007 : :00:14.0
-6000-6007 : CS5535 ACB0
-6100-61ff : cs5535-gpio
-  6100-61ff : :00:14.0
-6100-61ff : cs5535-gpio
-6200-63ff : cs5535-mfgpt
-  6200-63ff : :00:14.0
-6200-63ff : cs5535-mfgpt
 6620-662f : pc87360
   6620-662f : pc87360
 6640-664f : pc87360

A diff between good and bad boot output is not very edifying: all we
really see is ordering differences, address differences and things
dropping out of the log because of the absent hardware (mostly, I think,
the different time source because we're forced to fall back to the PIT
rather than using the nice fast cs5535-clockevt):

--- /tmp/good-boot  2015-01-30 20:57:36.464752599 +
+++ /tmp/bad-boot   2015-01-30

[3.18.3 BISECTED REGRESSION] scx200_acb / cs5535-smb / geodewdt / cs5535-clockevt torpedoed

2015-01-30 Thread Nix

As of 3.18.3, the watchdog timer, high-res timers, etc on my Soekris
net5501 no longer work.

dmesg says:

[1.465681] scx200_acb: can't allocate io 0x0-0x7
[1.473909] cs5535-smb: probe of cs5535-smb failed with error -5
[1.479991] pc87360: Device 0x09 not activated
[1.488237] geodewdt: No timers were available
[1.505612] Geode LX AES :00:01.2: GEODE AES engine enabled.
[1.511699] cs5535-clockevt: Could not allocate MFGPT timer

rather than the expected

[1.465987] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0' registered
[1.466747] pc87360: Device 0x09 not activated
[1.484989] cs5535-mfgpt cs5535-mfgpt: registered timer 0
[1.488096] Geode LX AES :00:01.2: GEODE AES engine enabled.
[1.504232] cs5535-mfgpt cs5535-mfgpt: registered timer 1
[1.506634] cs5535-clockevt: Registering MFGPT timer as a clock event, using 
IRQ 7

Bisected to:

commit efdb9b956aa06868a052f0d4387f5f34e2321e41
Author: Myron Stowe myron.st...@redhat.com
Date:   Thu Oct 30 11:54:37 2014 -0600

PCI: Restore detection of read-only BARs

commit 36e8164882ca6d3c41cb91e6f09a3ed236841f80 upstream.

before which everything works OK. (Bisection confirmed by reverting this
commit, following which everything works again.)

/proc/ioports in a good boot:

-001f : dma1
  - : cs5535-acpi
- : cs5535-pms
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-0060 : keyboard
0064-0064 : keyboard
0070-0071 : rtc_cmos
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : :00:14.2
01f0-01f7 : :00:14.2
  01f0-01f7 : pata_cs5536
0376-0376 : :00:14.2
03f6-03f6 : :00:14.2
  03f6-03f6 : pata_cs5536
03f8-03ff : serial
0cf8-0cff : PCI conf1
6000-6007 : cs5535-smb
  6000-6007 : :00:14.0
6000-6007 : CS5535 ACB0
6100-61ff : cs5535-gpio
  6100-61ff : :00:14.0
6100-61ff : cs5535-gpio
6200-63ff : cs5535-mfgpt
  6200-63ff : :00:14.0
6200-63ff : cs5535-mfgpt
6620-662f : pc87360
  6620-662f : pc87360
6640-664f : pc87360
  6640-664f : pc87360
d000-dfff : PCI Bus :01
  d000-d0ff : :01:00.0
d000-d0ff : via-rhine
  d100-d1ff : :01:01.0
d100-d1ff : via-rhine
  d200-d2ff : :01:02.0
d200-d2ff : via-rhine
  d300-d3ff : :01:03.0
d300-d3ff : via-rhine
e000-e00f : :00:14.2
  e000-e00f : pata_cs5536
e100-e1ff : :00:06.0
  e100-e1ff : via-rhine
e200-e2ff : :00:07.0
  e200-e2ff : via-rhine
e300-e3ff : :00:08.0
  e300-e3ff : via-rhine
e400-e4ff : :00:09.0
  e400-e4ff : via-rhine

Bad boot:

-001f : dma1
  -0007 : cs5535-smb
- : cs5535-acpi
  - : cs5535-pms
- : cs5535-mfgpt
  - : cs5535-gpio
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-0060 : keyboard
0064-0064 : keyboard
0070-0071 : rtc_cmos
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : :00:14.2
01f0-01f7 : :00:14.2
  01f0-01f7 : pata_cs5536
0376-0376 : :00:14.2
03f6-03f6 : :00:14.2
  03f6-03f6 : pata_cs5536
03f8-03ff : serial
0cf8-0cff : PCI conf1
6620-662f : pc87360
  6620-662f : pc87360
6640-664f : pc87360
  6640-664f : pc87360
d000-dfff : PCI Bus :01
  d000-d0ff : :01:00.0
d000-d0ff : via-rhine
  d100-d1ff : :01:01.0
d100-d1ff : via-rhine
  d200-d2ff : :01:02.0
d200-d2ff : via-rhine
  d300-d3ff : :01:03.0
d300-d3ff : via-rhine
e000-e00f : :00:14.2
  e000-e00f : pata_cs5536
e100-e1ff : :00:06.0
  e100-e1ff : via-rhine
e200-e2ff : :00:07.0
  e200-e2ff : via-rhine
e300-e3ff : :00:08.0
  e300-e3ff : via-rhine
e400-e4ff : :00:09.0
  e400-e4ff : via-rhine

A diff shows what I can only describe as an ioports traffic jam:

--- /tmp/good-ioports   2015-01-30 21:01:42.724692790 +
+++ /tmp/bad-ioports2015-01-30 21:01:51.803954107 +
@@ -1,6 +1,9 @@
 -001f : dma1
-  - : cs5535-acpi
-- : cs5535-pms
+  -0007 : cs5535-smb
+- : cs5535-acpi
+  - : cs5535-pms
+- : cs5535-mfgpt
+  - : cs5535-gpio
 0020-0021 : pic1
 0040-0043 : timer0
 0050-0053 : timer1
@@ -19,15 +22,6 @@
   03f6-03f6 : pata_cs5536
 03f8-03ff : serial
 0cf8-0cff : PCI conf1
-6000-6007 : cs5535-smb
-  6000-6007 : :00:14.0
-6000-6007 : CS5535 ACB0
-6100-61ff : cs5535-gpio
-  6100-61ff : :00:14.0
-6100-61ff : cs5535-gpio
-6200-63ff : cs5535-mfgpt
-  6200-63ff : :00:14.0
-6200-63ff : cs5535-mfgpt
 6620-662f : pc87360
   6620-662f : pc87360
 6640-664f : pc87360

A diff between good and bad boot output is not very edifying: all we
really see is ordering differences, address differences and things
dropping out of the log because of the absent hardware (mostly, I think,
the different time source because we're forced to fall back to the PIT
rather than using the nice fast cs5535-clockevt):

--- /tmp/good-boot  2015-01-30 20:57:36.464752599 +
+++ /tmp/bad-boot

3.18.x: the return of the NFSv3 client BUG on reboot

2015-01-30 Thread Nix

I'm seeing these BUGs, fairly reliably, on multiple NFSv3 clients (x86
and x86-64) as of 3.18. As with the bug a few months ago, it seems to be
a side effect of trying to create a new client while shutting everything
down:

[  567.092093] BUG: unable to handle kernel NULL pointer dereference at 0004
[  567.100044] IP: [c034fbc0] rpc_new_client+0x13b/0x1f2
[  567.100044] *pde = 
[  567.100044] Oops:  [#1]
[  567.100044] CPU: 0 PID: 521 Comm: su Not tainted 3.18.5+ #1
[  567.100044] task: de28cfc0 ti: de2ba000 task.ti: de2ba000
[  567.100044] EIP: 0060:[c034fbc0] EFLAGS: 00010246 CPU: 0
[  567.100044] EIP is at rpc_new_client+0x13b/0x1f2
[  567.100044] EAX:  EBX: c000c800 ECX: bae0 EDX: 0005
[  567.100044] ESI:  EDI: de2ea800 EBP: 0007 ESP: de2bbcac
[  567.100044]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[  567.100044] CR0: 8005003b CR2: 0004 CR3: 1f63b000 CR4: 0090
[  567.100044] Stack:
[  567.100044]  00d0 de2ea800 de2bbd4c  de2bbd4c de2bbd4c de2bbd10 
de2bbf8c
[  567.100044]  c03501ca de2bbd4c df57a940 de2bbd10 c0350334 0010 c0451150 
000280da
[  567.100044]  0002 0006 c044b178  de2bbddc 0010 df57a940 

[  567.100044] Call Trace:
[  567.100044]  [c03501ca] ? rpc_create_xprt+0xc/0x74
[  567.100044]  [c0350334] ? rpc_create+0x102/0x10f
[  567.100044]  [c019ec6a] ? __block_commit_write.isra.25+0x56/0x7f
[  567.100044]  [c0359f38] ? rpcb_create+0x6e/0x7c
[  567.100044]  [c035a6ab] ? rpcb_getport_async+0x124/0x25a
[  567.100044]  [c0133bdf] ? update_curr+0x81/0xb3
[  567.100044]  [c0133e97] ? check_preempt_wakeup+0xf0/0x134
[  567.100044]  [c013110f] ? check_preempt_curr+0x21/0x59
[  567.100044]  [c03552c8] ? rpcauth_lookupcred+0x3f/0x47
[  567.100044]  [c017ebb6] ? __kmalloc+0xa3/0xc4
[  567.100044]  [c0354c3d] ? rpc_malloc+0x39/0x48
[  567.100044]  [c034ef39] ? call_bind+0x2d/0x2e
[  567.100044]  [c0354aa5] ? __rpc_execute+0x5c/0x187
[  567.100044]  [c03500f2] ? rpc_run_task+0x55/0x5a
[  567.100044]  [c0350160] ? rpc_call_sync+0x69/0x81
[  567.100044]  [c01e219e] ? nsm_mon_unmon+0x8c/0xa0
[  567.100044]  [c01e23c5] ? nsm_unmonitor+0x5f/0xd3
[  567.100044]  [c016b3cc] ? bdi_unregister+0xf2/0x100
[  567.100044]  [c01df4cc] ? nlm_destroy_host_locked+0x4f/0x7c
[  567.100044]  [c01df703] ? nlmclnt_release_host+0xd8/0xe5
[  567.100044]  [c015] ? nlmclnt_done+0xc/0x14
[  567.100044]  [c01ce621] ? nfs_free_server+0x16/0x72
[  567.100044]  [c01837a4] ? deactivate_locked_super+0x26/0x37
[  567.100044]  [c019493e] ? cleanup_mnt+0x40/0x59
[  567.100044]  [c012e185] ? task_work_run+0x4f/0x5f
[  567.100044]  [c0121373] ? do_exit+0x264/0x670
[  567.100044]  [c0182848] ? vfs_write+0x98/0x151
[  567.100044]  [c01829c4] ? SyS_write+0x41/0x82
[  567.100044]  [c01217cc] ? do_group_exit+0x2b/0x5d
[  567.100044]  [c012180d] ? SyS_exit_group+0xf/0xf
[  567.100044]  [c03679f6] ? syscall_call+0x7/0x7
[  567.100044] Code: 89 43 40 8b 44 24 04 89 43 18 8d 43 78 8b 53 40 89 43 3c 
8b 12 e8 32 ac 00 00 c7 03 01 00 00 00 a1 dc f0 42 c0 8b 80 00 03 00 00 8b 70 
04 83 c6 45 89 f0 e8 77 d1 eb ff 83 f8 20 7f 05 89 43 44
[  567.100044] EIP: [c034fbc0] rpc_new_client+0x13b/0x1f2 SS:ESP 0068:de2bbcac
[  567.100044] CR2: 0004
[  567.100044] ---[ end trace 9d49b8a60cfa6a52 ]---
[  567.100044] Kernel panic - not syncing: Fatal exception
[  567.100044] Kernel Offset: 0x0 from 0xc010 (relocation range: 
0xc000-0xe07f)
[  567.100044] Rebooting in 5 seconds..

Notably, these only seem to happen after at least one instance of the
dreaded, frequently-occurring but normally harmless

[  101.019879] xprt_adjust_timeout: rq_timeout = 0!
[  101.047744] lockd: server spindle.srvr.nix not responding, still trying

(i.e. if you manage to reboot before one of these happens -- as happened
recently when I was bisecting another problem -- you don't seem to get a
panic).

NFS-related .config options:

CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
# CONFIG_NFS_V4_1 is not set
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y

If there's any more info I can provide, just ask. I can reproduce this
one quite fast and I have a box I can test on easily (it's my firewall
so I'm knocked off the net every time I do it, but it's not doing many
other jobs and I can reboot it without losing any state to speak of).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] sunrpc: NULL utsname dereference on NFS umount during namespace cleanup

2015-01-30 Thread Nix

On 30 Jan 2015, Trond Myklebust uttered the following:

 On Sun, 2015-01-25 at 16:55 -0500, Trond Myklebust wrote:
 On Sun, Jan 25, 2015 at 4:06 PM, Bruno Prémont
 bonb...@linux-vserver.org wrote:
  On a system running home-brown container (mntns, utsns, pidns, netns)
  with NFS mount-point bind-mounted into the container I hit the following
  trace if nfs filesystem is first umount()ed in init ns and then later
  umounted from container when the container exists.

I'm not using containers, but I *am* extensively bind-mounting
NFS filesystems, which probably has the same net effect.

 I was rather hoping that Bruno would fix up his patch and resend, but
 since other reports of the same bug are now surfacing... Please could
 you all check if something like the following patch fixes it.

I have to wait for another of those xprt != 0 warnings, I think. I've
had a couple of clean reboots, but I had the occasional clean reboot
even before this patch.

I'll let it run overnight and give it a reboot in the morning.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-11-06 Thread Nix

On 6 Nov 2014, Johan Hovold said:

> On Thu, Nov 06, 2014 at 01:49:13PM +0000, Nix wrote:
>> On 5 Nov 2014, Johan Hovold told this:
>> > On Wed, Nov 05, 2014 at 03:14:49PM +, Nix wrote:
>
>> > Could you try two more things (too make sure line control is really the
>> > culprit):
>> >
>> > 1. If you clear HUPCL in ekeyd so that the lines are never lowered, does
>> > that fix the stability issue?
>> 
>> Definitely not. I got a hang after the first reboot out of an afflicted
>> kernel, when using a HUPCLless ekeyd. Weird. (I guess they're lowered on
>> reboot anyway?)
>
> It's actually only the timings related to the control-lines being raised
> on open that has changed, so this would seem consistent with that.

Urgh. No wonder it was intermittent.

> Thanks for tracking this down. That bisect cannot have been fun given
> the low failure rate (sometimes only one in ten reboots?).

It often failed quite fast, but yes, the negative case was hard to
prove: I had to rewind twice. I counted reboots because I'm a flaming
aspie pedant. 173 reboots that took... thank goodness it replicated on
the machine of mine that's fastest to reboot!

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-11-06 Thread Nix

On 5 Nov 2014, Johan Hovold told this:

> On Wed, Nov 05, 2014 at 03:14:49PM +0000, Nix wrote:
>> >> >  What if you
>> >> > physically disconnect and reconnect the device instead, or simply
>> >> 
>> >> That fixes it, in fact it's the only way to fix it once it's broken by
>> >> this bug.
>> >
>> > I didn't mean whether it would get the device working again, but rather
>> > whether you could kill it this way.
>> 
>> Never seen it fail after a physical disconnection.
>
> Ok.
>
> And it has to include an enumeration, since just opening and closing it
> (restarting the daemon) repeatedly seems to work?

Well, it's more accurate to say that restarting the daemon doesn't make
it fail, but won't make it start working if it's already gone nuts
either. It normally copes fine with the daemon stopping and starting,
yes (and the daemon copes fine with keys being connected and
disconnected).

>> > Yeah, it seems your device firmware has crashed. It stops responding to
>> > control requests.
>> 
>> Not surprising: I was fairly sure we were provoking a key firmware crash
>> or something like that. This is a device with no support for flow
>> control at all, after all, so I'm not terribly surprised that trying to
>> do flow control confuses it.
>
> Only weird thing is that it has been coping with those calls for a long
> time. Only the slightly changed timings seems to have triggered this
> issue.

Yeah. I get the strong impression from Daniel that the USB side of this
was done by taking something that worked on the kernel of the day,
adding the key-specific stuff to it, and making sure that it still
worked. i.e. this was not a from-spec reimplementation, it was using an
existing (old) stack. If that stack makes iffy assumptions, so will the
entropy key.

>> Look at it? Only Daniel Silverstone (Cc:ed) can do that. The only copy
>> of the firmware I have is baked into the sealed key. :)

As his email noted, no he can't :) but you do now have a link to the
thing it was based on.

> Ah, ok. I guess we need a new quirk then. There's already a quirk in the
> driver to suppress error from when trying to set the control lines.
>
> But that doesn't help this device, which happily accepts the requests
> and then dies at random times.

Yeah. Strange. (I thought it was it's 'right after', but you seem to
have disproved that. It's only 'sometime later'.)

> Could you try two more things (too make sure line control is really the
> culprit):
>
> 1. If you clear HUPCL in ekeyd so that the lines are never lowered, does
> that fix the stability issue?

Definitely not. I got a hang after the first reboot out of an afflicted
kernel, when using a HUPCLless ekeyd. Weird. (I guess they're lowered on
reboot anyway?)

> 2. Can you verify that the patch below works? Did I use the correct
> VID/PID? Could you provide "lsusb -v" output (for the capability flags)
> as well?

The VID/PID are right.

The patch seems to work. I tested against the usb testing tree directly,
since that was easier than adjusting it to apply against 3.16. Sixteen
reboots, no failures, looks to be fixed.

lsusb output:

Bus 002 Device 003: ID 20df:0001 Simtec Electronics Entropy Key [UDEKEY01]
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass2 Communications
  bDeviceSubClass 0
  bDeviceProtocol 0
  bMaxPacketSize064
  idVendor   0x20df Simtec Electronics
  idProduct  0x0001 Entropy Key [UDEKEY01]
  bcdDevice2.00
  iManufacturer   1 Simtec Electronics
  iProduct2 Entropy Key
  iSerial 3 M/9tBjBLNzE2RSFD
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   67
bNumInterfaces  2
bConfigurationValue 1
iConfiguration  0
bmAttributes 0xc0
  Self Powered
MaxPower   76mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 2 Communications
  bInterfaceSubClass  2 Abstract (modem)
  bInterfaceProtocol  1 AT-commands (v.25ter)
  iInterface  0
  CDC Header:
bcdCDC   1.10
  CDC Call Management:
bmCapabilities   0x00
bDataInterface  1
  CDC ACM:
bmCapabilities   0x00
  CDC Union:
bMasterInterface0
bSlaveInterface 1
  Endpoint Descriptor:
bLength

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-11-06 Thread Nix

On 6 Nov 2014, Johan Hovold said:

 On Thu, Nov 06, 2014 at 01:49:13PM +, Nix wrote:
 On 5 Nov 2014, Johan Hovold told this:
  On Wed, Nov 05, 2014 at 03:14:49PM +, Nix wrote:

  Could you try two more things (too make sure line control is really the
  culprit):
 
  1. If you clear HUPCL in ekeyd so that the lines are never lowered, does
  that fix the stability issue?
 
 Definitely not. I got a hang after the first reboot out of an afflicted
 kernel, when using a HUPCLless ekeyd. Weird. (I guess they're lowered on
 reboot anyway?)

 It's actually only the timings related to the control-lines being raised
 on open that has changed, so this would seem consistent with that.

Urgh. No wonder it was intermittent.

 Thanks for tracking this down. That bisect cannot have been fun given
 the low failure rate (sometimes only one in ten reboots?).

It often failed quite fast, but yes, the negative case was hard to
prove: I had to rewind twice. I counted reboots because I'm a flaming
aspie pedant. 173 reboots that took... thank goodness it replicated on
the machine of mine that's fastest to reboot!

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-11-06 Thread Nix

On 5 Nov 2014, Johan Hovold told this:

 On Wed, Nov 05, 2014 at 03:14:49PM +, Nix wrote:
What if you
   physically disconnect and reconnect the device instead, or simply
  
  That fixes it, in fact it's the only way to fix it once it's broken by
  this bug.
 
  I didn't mean whether it would get the device working again, but rather
  whether you could kill it this way.
 
 Never seen it fail after a physical disconnection.

 Ok.

 And it has to include an enumeration, since just opening and closing it
 (restarting the daemon) repeatedly seems to work?

Well, it's more accurate to say that restarting the daemon doesn't make
it fail, but won't make it start working if it's already gone nuts
either. It normally copes fine with the daemon stopping and starting,
yes (and the daemon copes fine with keys being connected and
disconnected).

  Yeah, it seems your device firmware has crashed. It stops responding to
  control requests.
 
 Not surprising: I was fairly sure we were provoking a key firmware crash
 or something like that. This is a device with no support for flow
 control at all, after all, so I'm not terribly surprised that trying to
 do flow control confuses it.

 Only weird thing is that it has been coping with those calls for a long
 time. Only the slightly changed timings seems to have triggered this
 issue.

Yeah. I get the strong impression from Daniel that the USB side of this
was done by taking something that worked on the kernel of the day,
adding the key-specific stuff to it, and making sure that it still
worked. i.e. this was not a from-spec reimplementation, it was using an
existing (old) stack. If that stack makes iffy assumptions, so will the
entropy key.

 Look at it? Only Daniel Silverstone (Cc:ed) can do that. The only copy
 of the firmware I have is baked into the sealed key. :)

As his email noted, no he can't :) but you do now have a link to the
thing it was based on.

 Ah, ok. I guess we need a new quirk then. There's already a quirk in the
 driver to suppress error from when trying to set the control lines.

 But that doesn't help this device, which happily accepts the requests
 and then dies at random times.

Yeah. Strange. (I thought it was it's 'right after', but you seem to
have disproved that. It's only 'sometime later'.)

 Could you try two more things (too make sure line control is really the
 culprit):

 1. If you clear HUPCL in ekeyd so that the lines are never lowered, does
 that fix the stability issue?

Definitely not. I got a hang after the first reboot out of an afflicted
kernel, when using a HUPCLless ekeyd. Weird. (I guess they're lowered on
reboot anyway?)

 2. Can you verify that the patch below works? Did I use the correct
 VID/PID? Could you provide lsusb -v output (for the capability flags)
 as well?

The VID/PID are right.

The patch seems to work. I tested against the usb testing tree directly,
since that was easier than adjusting it to apply against 3.16. Sixteen
reboots, no failures, looks to be fixed.

lsusb output:

Bus 002 Device 003: ID 20df:0001 Simtec Electronics Entropy Key [UDEKEY01]
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass2 Communications
  bDeviceSubClass 0
  bDeviceProtocol 0
  bMaxPacketSize064
  idVendor   0x20df Simtec Electronics
  idProduct  0x0001 Entropy Key [UDEKEY01]
  bcdDevice2.00
  iManufacturer   1 Simtec Electronics
  iProduct2 Entropy Key
  iSerial 3 M/9tBjBLNzE2RSFD
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   67
bNumInterfaces  2
bConfigurationValue 1
iConfiguration  0
bmAttributes 0xc0
  Self Powered
MaxPower   76mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 2 Communications
  bInterfaceSubClass  2 Abstract (modem)
  bInterfaceProtocol  1 AT-commands (v.25ter)
  iInterface  0
  CDC Header:
bcdCDC   1.10
  CDC Call Management:
bmCapabilities   0x00
bDataInterface  1
  CDC ACM:
bmCapabilities   0x00
  CDC Union:
bMasterInterface0
bSlaveInterface 1
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x82  EP 2 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0008  1x 8 bytes
bInterval 255

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-11-05 Thread Nix

On 5 Nov 2014, Johan Hovold stated:

> On Fri, Oct 31, 2014 at 04:44:46PM +0000, Nix wrote:
>> Sorry for the delay: illness and work-related release time flurries.
>> 
>> On 24 Oct 2014, Johan Hovold told this:
>> 
>> > [ +CC: linux-usb ]
>> >
>> > On Wed, Oct 22, 2014 at 04:36:59PM +0100, Nix wrote:
>> >> On 22 Oct 2014, Johan Hovold outgrape:
>> >> 
>> >> > On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
>> >> >> On 14 Oct 2014, Johan Hovold verbalised:
>> >> >> 
>> >> >> > On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
>> >> >> >> I have checked: this code is being executed against a symlink that
>> >> >> >> points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, 
>> >> >> >> it's
>> >> >> >> succeeding on the kernel I'm running now, but of course that's 
>> >> >> >> 3.16.5
>> >> >> >> with this commit reverted...)
>> >> >> >
>> >> >> > You could verify that by enabling debugging in the cdc-acm driver and
>> >> >> > making sure that the corresponding control messages are indeed sent 
>> >> >> > on
>> >> >> > close.
>> >> >> 
>> >> >> I have a debugging dump at
>> >> >> <http://www.esperi.org.uk/~nix/temporary/cdc-acm.log>; it's fairly
>> >> >
>> >> > What kernel were you using here? The log seems to suggest that it was
>> >> > generated with the commit in question reverted.
>> >> 
>> >> Look now :) (the previous log is still there in cdc-acm-reverted.log.)
>> >
>> > Unfortunately, it seems the logs are incomplete. There are lots of
>> > entries missing (e.g. "acm_tty_install" when opening, but also some
>> > "acm_submit_read_urb"), some of which were there in your first log.
>> 
>> OK. What we have now in
>> <http://www.esperi.org.uk/~nix/temporary/cdc-acm.log> is a log from the
>> pristine upstream 3.16.6 kernel with cdc-acm debugging turned on and the
>> acm_tty_write - count stuff in acm_tty_write() disabled: I've increased
>> the dmesg buffer size so the top isn't being cut off any more. It took a
>> lot of boots for it to fail this time: about a dozen. The log contains
>> the boot that failed and the one before.
>> 
>> (To my uneducated eye, the initial traffic to/from the key looks *very*
>> different in the second boot. Something is clearly wrong by this point,
>> but that's not much of a surprise!)
>
> The log appears incomplete again, this time it seems the last part is
> completely missing (the device is never closed for example). The device
> still seems to be responding.

Nope -- by the time I clipped the log, the device was definitely
nonresponsive.

I've appended the remaining log, just in case. This is the same as the
snapshot I added to the email itself last time: a close-and-open as I
tried restarting the daemon, and another close as part of system
shutdown.

>> >  What if you
>> > physically disconnect and reconnect the device instead, or simply
>> 
>> That fixes it, in fact it's the only way to fix it once it's broken by
>> this bug.
>
> I didn't mean whether it would get the device working again, but rather
> whether you could kill it this way.

Never seen it fail after a physical disconnection.

>> ... no, that doesn't help. Additional log from that shows a lot of what
>> looks like error returns:
>> 
>> Oct 31 16:38:03 fold kern debug: : [  168.135213] cdc_acm 2-1:1.0: 
>> acm_tty_close
>> Oct 31 16:38:08 fold kern debug: : [  173.130531] cdc_acm 2-1:1.0: 
>> acm_ctrl_msg - rq 0x22, val 0x0, len 0x0, result -110
>
> Yeah, it seems your device firmware has crashed. It stops responding to
> control requests.

Not surprising: I was fairly sure we were provoking a key firmware crash
or something like that. This is a device with no support for flow
control at all, after all, so I'm not terribly surprised that trying to
do flow control confuses it.

> The above is all normal, but
>
>> Oct 31 16:38:08 fold kern debug: : [  173.161489] cdc_acm 2-1:1.0: 
>> acm_ctrl_msg - rq 0x22, val 0x3, len 0x0, result -62
>
> here's another timeout. It's dead.

Again, not surprising.

> Did you get anywhere with trying to look at the device firmware?

Look at it? Only Daniel Silverstone (Cc:ed) can do that. The only copy
of the firmware I have is baked into the sealed key. :)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-11-05 Thread Nix

On 5 Nov 2014, Johan Hovold stated:

 On Fri, Oct 31, 2014 at 04:44:46PM +, Nix wrote:
 Sorry for the delay: illness and work-related release time flurries.
 
 On 24 Oct 2014, Johan Hovold told this:
 
  [ +CC: linux-usb ]
 
  On Wed, Oct 22, 2014 at 04:36:59PM +0100, Nix wrote:
  On 22 Oct 2014, Johan Hovold outgrape:
  
   On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
   On 14 Oct 2014, Johan Hovold verbalised:
   
On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
I have checked: this code is being executed against a symlink that
points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, 
it's
succeeding on the kernel I'm running now, but of course that's 
3.16.5
with this commit reverted...)
   
You could verify that by enabling debugging in the cdc-acm driver and
making sure that the corresponding control messages are indeed sent 
on
close.
   
   I have a debugging dump at
   http://www.esperi.org.uk/~nix/temporary/cdc-acm.log; it's fairly
  
   What kernel were you using here? The log seems to suggest that it was
   generated with the commit in question reverted.
  
  Look now :) (the previous log is still there in cdc-acm-reverted.log.)
 
  Unfortunately, it seems the logs are incomplete. There are lots of
  entries missing (e.g. acm_tty_install when opening, but also some
  acm_submit_read_urb), some of which were there in your first log.
 
 OK. What we have now in
 http://www.esperi.org.uk/~nix/temporary/cdc-acm.log is a log from the
 pristine upstream 3.16.6 kernel with cdc-acm debugging turned on and the
 acm_tty_write - count stuff in acm_tty_write() disabled: I've increased
 the dmesg buffer size so the top isn't being cut off any more. It took a
 lot of boots for it to fail this time: about a dozen. The log contains
 the boot that failed and the one before.
 
 (To my uneducated eye, the initial traffic to/from the key looks *very*
 different in the second boot. Something is clearly wrong by this point,
 but that's not much of a surprise!)

 The log appears incomplete again, this time it seems the last part is
 completely missing (the device is never closed for example). The device
 still seems to be responding.

Nope -- by the time I clipped the log, the device was definitely
nonresponsive.

I've appended the remaining log, just in case. This is the same as the
snapshot I added to the email itself last time: a close-and-open as I
tried restarting the daemon, and another close as part of system
shutdown.

   What if you
  physically disconnect and reconnect the device instead, or simply
 
 That fixes it, in fact it's the only way to fix it once it's broken by
 this bug.

 I didn't mean whether it would get the device working again, but rather
 whether you could kill it this way.

Never seen it fail after a physical disconnection.

 ... no, that doesn't help. Additional log from that shows a lot of what
 looks like error returns:
 
 Oct 31 16:38:03 fold kern debug: : [  168.135213] cdc_acm 2-1:1.0: 
 acm_tty_close
 Oct 31 16:38:08 fold kern debug: : [  173.130531] cdc_acm 2-1:1.0: 
 acm_ctrl_msg - rq 0x22, val 0x0, len 0x0, result -110

 Yeah, it seems your device firmware has crashed. It stops responding to
 control requests.

Not surprising: I was fairly sure we were provoking a key firmware crash
or something like that. This is a device with no support for flow
control at all, after all, so I'm not terribly surprised that trying to
do flow control confuses it.

 The above is all normal, but

 Oct 31 16:38:08 fold kern debug: : [  173.161489] cdc_acm 2-1:1.0: 
 acm_ctrl_msg - rq 0x22, val 0x3, len 0x0, result -62

 here's another timeout. It's dead.

Again, not surprising.

 Did you get anywhere with trying to look at the device firmware?

Look at it? Only Daniel Silverstone (Cc:ed) can do that. The only copy
of the firmware I have is baked into the sealed key. :)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-31 Thread Nix

Sorry for the delay: illness and work-related release time flurries.

On 24 Oct 2014, Johan Hovold told this:

> [ +CC: linux-usb ]
>
> On Wed, Oct 22, 2014 at 04:36:59PM +0100, Nix wrote:
>> On 22 Oct 2014, Johan Hovold outgrape:
>> 
>> > On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
>> >> On 14 Oct 2014, Johan Hovold verbalised:
>> >> 
>> >> > On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
>> >> >> I have checked: this code is being executed against a symlink that
>> >> >> points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
>> >> >> succeeding on the kernel I'm running now, but of course that's 3.16.5
>> >> >> with this commit reverted...)
>> >> >
>> >> > You could verify that by enabling debugging in the cdc-acm driver and
>> >> > making sure that the corresponding control messages are indeed sent on
>> >> > close.
>> >> 
>> >> I have a debugging dump at
>> >> <http://www.esperi.org.uk/~nix/temporary/cdc-acm.log>; it's fairly
>> >
>> > What kernel were you using here? The log seems to suggest that it was
>> > generated with the commit in question reverted.
>> 
>> Look now :) (the previous log is still there in cdc-acm-reverted.log.)
>
> Unfortunately, it seems the logs are incomplete. There are lots of
> entries missing (e.g. "acm_tty_install" when opening, but also some
> "acm_submit_read_urb"), some of which were there in your first log.

OK. What we have now in
<http://www.esperi.org.uk/~nix/temporary/cdc-acm.log> is a log from the
pristine upstream 3.16.6 kernel with cdc-acm debugging turned on and the
acm_tty_write - count stuff in acm_tty_write() disabled: I've increased
the dmesg buffer size so the top isn't being cut off any more. It took a
lot of boots for it to fail this time: about a dozen. The log contains
the boot that failed and the one before.

(To my uneducated eye, the initial traffic to/from the key looks *very*
different in the second boot. Something is clearly wrong by this point,
but that's not much of a surprise!)

>> This contains two boots -- one on which the USB key worked, and the next
>> (with an identical kernel) with which the key was broken. (I'm not sure
>> whether this problem happens at startup or shutdown time, so it seemed
>> best to provide both.)
>
> That's a good idea.
>
> Is it only after reboot you've seen the device fail?

So far.

>  What if you
> physically disconnect and reconnect the device instead, or simply

That fixes it, in fact it's the only way to fix it once it's broken by
this bug.

> repeatedly open and close it?

Hm. Good idea.

... no, that doesn't help. Additional log from that shows a lot of what
looks like error returns:

Oct 31 16:38:03 fold kern debug: : [  168.135213] cdc_acm 2-1:1.0: acm_tty_close
Oct 31 16:38:08 fold kern debug: : [  173.130531] cdc_acm 2-1:1.0: acm_ctrl_msg 
- rq 0x22, val 0x0, len 0x0, result -110
Oct 31 16:38:08 fold kern debug: : [  173.130557] cdc_acm 2-1:1.0: 
acm_port_shutdown
Oct 31 16:38:08 fold kern debug: : [  173.131475] cdc_acm 2-1:1.0: acm_ctrl_irq 
- urb shutting down with status: -2
Oct 31 16:38:08 fold kern debug: : [  173.132474] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.132519] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.133473] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.133510] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.134471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.134507] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.135471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.135509] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.136472] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.136507] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.137471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.137517] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.138471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.138520] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.139469] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.139515] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern deb

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-31 Thread Nix

Sorry for the delay: illness and work-related release time flurries.

On 24 Oct 2014, Johan Hovold told this:

 [ +CC: linux-usb ]

 On Wed, Oct 22, 2014 at 04:36:59PM +0100, Nix wrote:
 On 22 Oct 2014, Johan Hovold outgrape:
 
  On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
  On 14 Oct 2014, Johan Hovold verbalised:
  
   On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
   I have checked: this code is being executed against a symlink that
   points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
   succeeding on the kernel I'm running now, but of course that's 3.16.5
   with this commit reverted...)
  
   You could verify that by enabling debugging in the cdc-acm driver and
   making sure that the corresponding control messages are indeed sent on
   close.
  
  I have a debugging dump at
  http://www.esperi.org.uk/~nix/temporary/cdc-acm.log; it's fairly
 
  What kernel were you using here? The log seems to suggest that it was
  generated with the commit in question reverted.
 
 Look now :) (the previous log is still there in cdc-acm-reverted.log.)

 Unfortunately, it seems the logs are incomplete. There are lots of
 entries missing (e.g. acm_tty_install when opening, but also some
 acm_submit_read_urb), some of which were there in your first log.

OK. What we have now in
http://www.esperi.org.uk/~nix/temporary/cdc-acm.log is a log from the
pristine upstream 3.16.6 kernel with cdc-acm debugging turned on and the
acm_tty_write - count stuff in acm_tty_write() disabled: I've increased
the dmesg buffer size so the top isn't being cut off any more. It took a
lot of boots for it to fail this time: about a dozen. The log contains
the boot that failed and the one before.

(To my uneducated eye, the initial traffic to/from the key looks *very*
different in the second boot. Something is clearly wrong by this point,
but that's not much of a surprise!)

 This contains two boots -- one on which the USB key worked, and the next
 (with an identical kernel) with which the key was broken. (I'm not sure
 whether this problem happens at startup or shutdown time, so it seemed
 best to provide both.)

 That's a good idea.

 Is it only after reboot you've seen the device fail?

So far.

  What if you
 physically disconnect and reconnect the device instead, or simply

That fixes it, in fact it's the only way to fix it once it's broken by
this bug.

 repeatedly open and close it?

Hm. Good idea.

... no, that doesn't help. Additional log from that shows a lot of what
looks like error returns:

Oct 31 16:38:03 fold kern debug: : [  168.135213] cdc_acm 2-1:1.0: acm_tty_close
Oct 31 16:38:08 fold kern debug: : [  173.130531] cdc_acm 2-1:1.0: acm_ctrl_msg 
- rq 0x22, val 0x0, len 0x0, result -110
Oct 31 16:38:08 fold kern debug: : [  173.130557] cdc_acm 2-1:1.0: 
acm_port_shutdown
Oct 31 16:38:08 fold kern debug: : [  173.131475] cdc_acm 2-1:1.0: acm_ctrl_irq 
- urb shutting down with status: -2
Oct 31 16:38:08 fold kern debug: : [  173.132474] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.132519] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.133473] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.133510] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.134471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.134507] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.135471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.135509] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.136472] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.136507] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.137471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.137517] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.138471] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.138520] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.139469] cdc_acm 2-1:1.1: 
acm_write_bulk - len 0/1, status -2
Oct 31 16:38:08 fold kern debug: : [  173.139515] cdc_acm 2-1:1.1: acm_softint
Oct 31 16:38:08 fold kern debug: : [  173.140470] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 0, len 0
Oct 31 16:38:08 fold kern debug: : [  173.140491] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 31 16:38:08 fold kern debug: : [  173.141469] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 1, len 0
Oct 31 16:38:08 fold kern debug: : [  173.141489] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 31 16:38:08 fold kern debug: : [  173.142469] cdc_acm 2-1:1.1

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-24 Thread Nix

On 24 Oct 2014, Johan Hovold told this:

> [ +CC: linux-usb ]
> On Wed, Oct 22, 2014 at 04:36:59PM +0100, Nix wrote:
>> On 22 Oct 2014, Johan Hovold outgrape:
>> 
>> > On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
>> >> On 14 Oct 2014, Johan Hovold verbalised:
>> >> 
>> >> > On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
>> >> >> I have checked: this code is being executed against a symlink that
>> >> >> points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
>> >> >> succeeding on the kernel I'm running now, but of course that's 3.16.5
>> >> >> with this commit reverted...)
>> >> >
>> >> > You could verify that by enabling debugging in the cdc-acm driver and
>> >> > making sure that the corresponding control messages are indeed sent on
>> >> > close.
>> >> 
>> >> I have a debugging dump at
>> >> <http://www.esperi.org.uk/~nix/temporary/cdc-acm.log>; it's fairly
>> >
>> > What kernel were you using here? The log seems to suggest that it was
>> > generated with the commit in question reverted.
>> 
>> Look now :) (the previous log is still there in cdc-acm-reverted.log.)
>
> Unfortunately, it seems the logs are incomplete. There are lots of
> entries missing (e.g. "acm_tty_install" when opening, but also some
> "acm_submit_read_urb"), some of which were there in your first log.

Curious. It's a straight cut-and-paste from the syslog. Maybe the kmsg
buffer overflowed, but I start the ekey daemon *after* I start syslogd,
so that seems unlikely.

I'll have another look.

> Is it only after reboot you've seen the device fail?

Yes (though sometimes after reboot of an unaffected kernel into an
affected one! i.e. sometimes the first boot into an affected kernel is
broken).

>  What if you
> physically disconnect and reconnect the device instead,

That makes it work.

> or simply
> repeatedly open and close it?

IIRC -- but I'll have to check this tomorrow when I look at the log
problem, so don't take it as gospel -- that doesn't affect it: I can
stop and restart the daemon repeatedly and, if it wasn't working before,
it's not working afterwards: if it was working before, it'll be working
afterwards.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-24 Thread Nix

On 24 Oct 2014, Johan Hovold told this:

 [ +CC: linux-usb ]
 On Wed, Oct 22, 2014 at 04:36:59PM +0100, Nix wrote:
 On 22 Oct 2014, Johan Hovold outgrape:
 
  On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
  On 14 Oct 2014, Johan Hovold verbalised:
  
   On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
   I have checked: this code is being executed against a symlink that
   points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
   succeeding on the kernel I'm running now, but of course that's 3.16.5
   with this commit reverted...)
  
   You could verify that by enabling debugging in the cdc-acm driver and
   making sure that the corresponding control messages are indeed sent on
   close.
  
  I have a debugging dump at
  http://www.esperi.org.uk/~nix/temporary/cdc-acm.log; it's fairly
 
  What kernel were you using here? The log seems to suggest that it was
  generated with the commit in question reverted.
 
 Look now :) (the previous log is still there in cdc-acm-reverted.log.)

 Unfortunately, it seems the logs are incomplete. There are lots of
 entries missing (e.g. acm_tty_install when opening, but also some
 acm_submit_read_urb), some of which were there in your first log.

Curious. It's a straight cut-and-paste from the syslog. Maybe the kmsg
buffer overflowed, but I start the ekey daemon *after* I start syslogd,
so that seems unlikely.

I'll have another look.

 Is it only after reboot you've seen the device fail?

Yes (though sometimes after reboot of an unaffected kernel into an
affected one! i.e. sometimes the first boot into an affected kernel is
broken).

  What if you
 physically disconnect and reconnect the device instead,

That makes it work.

 or simply
 repeatedly open and close it?

IIRC -- but I'll have to check this tomorrow when I look at the log
problem, so don't take it as gospel -- that doesn't affect it: I can
stop and restart the daemon repeatedly and, if it wasn't working before,
it's not working afterwards: if it was working before, it'll be working
afterwards.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-22 Thread Nix

On 22 Oct 2014, Johan Hovold outgrape:

> On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
>> On 14 Oct 2014, Johan Hovold verbalised:
>> 
>> > On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
>> >> I have checked: this code is being executed against a symlink that
>> >> points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
>> >> succeeding on the kernel I'm running now, but of course that's 3.16.5
>> >> with this commit reverted...)
>> >
>> > You could verify that by enabling debugging in the cdc-acm driver and
>> > making sure that the corresponding control messages are indeed sent on
>> > close.
>> 
>> I have a debugging dump at
>> <http://www.esperi.org.uk/~nix/temporary/cdc-acm.log>; it's fairly
>
> What kernel were you using here? The log seems to suggest that it was
> generated with the commit in question reverted.

Look now :) (the previous log is still there in cdc-acm-reverted.log.)

This contains two boots -- one on which the USB key worked, and the next
(with an identical kernel) with which the key was broken. (I'm not sure
whether this problem happens at startup or shutdown time, so it seemed
best to provide both.)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-22 Thread Nix

On 22 Oct 2014, Johan Hovold uttered the following:

> On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
>> On 14 Oct 2014, Johan Hovold verbalised:
>> 
>> > On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
>> >> I have checked: this code is being executed against a symlink that
>> >> points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
>> >> succeeding on the kernel I'm running now, but of course that's 3.16.5
>> >> with this commit reverted...)
>> >
>> > You could verify that by enabling debugging in the cdc-acm driver and
>> > making sure that the corresponding control messages are indeed sent on
>> > close.
>> 
>> I have a debugging dump at
>> <http://www.esperi.org.uk/~nix/temporary/cdc-acm.log>; it's fairly
>
> What kernel were you using here? The log seems to suggest that it was
> generated with the commit in question reverted.

Wurgle. I think I was probably using the wrong one, as you suggest.
Hell. That's no use at all is it. Still, at least you know what it looks
like when it works. :)

I'll retry with the right one.

> What kernel version are you using? And do you have autosuspend enabled?

3.16.6 (but it's happened for 3.16.* obviously). No autosuspend (no
*way* can this box suspend, but it only draws a couple of watts so I
don't care).

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-22 Thread Nix

On 14 Oct 2014, Johan Hovold verbalised:

> On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
>> I have checked: this code is being executed against a symlink that
>> points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
>> succeeding on the kernel I'm running now, but of course that's 3.16.5
>> with this commit reverted...)
>
> You could verify that by enabling debugging in the cdc-acm driver and
> making sure that the corresponding control messages are indeed sent on
> close.

I have a debugging dump at
<http://www.esperi.org.uk/~nix/temporary/cdc-acm.log>; it's fairly
voluminous because the ekeyd is constantly doing USB reads, but the end
says

Oct 22 10:19:13 fold kern debug: : [   88.423970] cdc_acm 2-1:1.0: acm_tty_close
Oct 22 10:19:13 fold kern debug: : [   88.424012] cdc_acm 2-1:1.0: 
acm_port_shutdown
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.0: acm_ctrl_msg 
- rq 0x22, val 0x0, len 0x0, result 0
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.0: acm_ctrl_irq 
- urb shutting down with status: -2
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 0, len 0
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.447588] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 1, len 0
Oct 22 10:19:13 fold kern debug: : [   88.447613] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.448575] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 2, len 0
Oct 22 10:19:13 fold kern debug: : [   88.448599] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.449576] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 3, len 0
Oct 22 10:19:13 fold kern debug: : [   88.449599] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.450578] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 4, len 0
Oct 22 10:19:13 fold kern debug: : [   88.450602] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.451573] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 5, len 0
Oct 22 10:19:13 fold kern debug: : [   88.451596] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.452574] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 6, len 0
Oct 22 10:19:13 fold kern debug: : [   88.452597] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.453567] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 7, len 0
Oct 22 10:19:13 fold kern debug: : [   88.453588] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.454570] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 8, len 0
Oct 22 10:19:13 fold kern debug: : [   88.454592] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.462591] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 9, len 0
Oct 22 10:19:13 fold kern debug: : [   88.462619] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.463568] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 10, len 0
Oct 22 10:19:13 fold kern debug: : [   88.463590] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.464564] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 11, len 0
Oct 22 10:19:13 fold kern debug: : [   88.464585] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.465578] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 12, len 0
Oct 22 10:19:13 fold kern debug: : [   88.465602] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.466566] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 13, len 0
Oct 22 10:19:13 fold kern debug: : [   88.466587] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2

which looks, hm, a bit suspicious to me.

> But you haven't seen any fw crashes since you reverted the commit in
> question?

Not a one.

> Another thing you could try is to add back the 
>
>   acm_set_control(acm, 0);
>
> just after the dev_info message in probe.

Tried that (with, obviously, the commit not reverted) -- rebooted, and

BytesRead=0
BytesWritten=0
ConnectionNonces=0
ConnectionPackets=0
ConnectionRekeys=0
ConnectionResets=0
ConnectionTime=46
EntropyRate=0
FipsFrameRate=0
FrameByteLast=0
FramesOk=0
FramingErrors=0
KeyDbsdShannonPerByteL=0
KeyDbsdShannonPerByteR=0
KeyEnglishBadness=No failure
KeyRawBadness=0
KeyRawShannonPerByteL=0
KeyRawShannonPerByteR=0
KeyRawShannonPerByteX=0
KeyShortBadness=efm_ok

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-22 Thread Nix

On 14 Oct 2014, Johan Hovold verbalised:

 On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
 I have checked: this code is being executed against a symlink that
 points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
 succeeding on the kernel I'm running now, but of course that's 3.16.5
 with this commit reverted...)

 You could verify that by enabling debugging in the cdc-acm driver and
 making sure that the corresponding control messages are indeed sent on
 close.

I have a debugging dump at
http://www.esperi.org.uk/~nix/temporary/cdc-acm.log; it's fairly
voluminous because the ekeyd is constantly doing USB reads, but the end
says

Oct 22 10:19:13 fold kern debug: : [   88.423970] cdc_acm 2-1:1.0: acm_tty_close
Oct 22 10:19:13 fold kern debug: : [   88.424012] cdc_acm 2-1:1.0: 
acm_port_shutdown
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.0: acm_ctrl_msg 
- rq 0x22, val 0x0, len 0x0, result 0
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.0: acm_ctrl_irq 
- urb shutting down with status: -2
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 0, len 0
Oct 22 10:19:13 fold kern debug: : [   88.440038] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.447588] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 1, len 0
Oct 22 10:19:13 fold kern debug: : [   88.447613] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.448575] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 2, len 0
Oct 22 10:19:13 fold kern debug: : [   88.448599] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.449576] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 3, len 0
Oct 22 10:19:13 fold kern debug: : [   88.449599] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.450578] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 4, len 0
Oct 22 10:19:13 fold kern debug: : [   88.450602] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.451573] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 5, len 0
Oct 22 10:19:13 fold kern debug: : [   88.451596] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.452574] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 6, len 0
Oct 22 10:19:13 fold kern debug: : [   88.452597] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.453567] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 7, len 0
Oct 22 10:19:13 fold kern debug: : [   88.453588] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.454570] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 8, len 0
Oct 22 10:19:13 fold kern debug: : [   88.454592] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.462591] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 9, len 0
Oct 22 10:19:13 fold kern debug: : [   88.462619] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.463568] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 10, len 0
Oct 22 10:19:13 fold kern debug: : [   88.463590] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.464564] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 11, len 0
Oct 22 10:19:13 fold kern debug: : [   88.464585] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.465578] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 12, len 0
Oct 22 10:19:13 fold kern debug: : [   88.465602] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2
Oct 22 10:19:13 fold kern debug: : [   88.466566] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - urb 13, len 0
Oct 22 10:19:13 fold kern debug: : [   88.466587] cdc_acm 2-1:1.1: 
acm_read_bulk_callback - non-zero urb status: -2

which looks, hm, a bit suspicious to me.

 But you haven't seen any fw crashes since you reverted the commit in
 question?

Not a one.

 Another thing you could try is to add back the 

   acm_set_control(acm, 0);

 just after the dev_info message in probe.

Tried that (with, obviously, the commit not reverted) -- rebooted, and

BytesRead=0
BytesWritten=0
ConnectionNonces=0
ConnectionPackets=0
ConnectionRekeys=0
ConnectionResets=0
ConnectionTime=46
EntropyRate=0
FipsFrameRate=0
FrameByteLast=0
FramesOk=0
FramingErrors=0
KeyDbsdShannonPerByteL=0
KeyDbsdShannonPerByteR=0
KeyEnglishBadness=No failure
KeyRawBadness=0
KeyRawShannonPerByteL=0
KeyRawShannonPerByteR=0
KeyRawShannonPerByteX=0
KeyShortBadness=efm_ok
KeyTemperatureC=-273.15
KeyTemperatureF=-459.67
KeyTemperatureK=0
KeyVoltage=0

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-22 Thread Nix

On 22 Oct 2014, Johan Hovold uttered the following:

 On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
 On 14 Oct 2014, Johan Hovold verbalised:
 
  On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
  I have checked: this code is being executed against a symlink that
  points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
  succeeding on the kernel I'm running now, but of course that's 3.16.5
  with this commit reverted...)
 
  You could verify that by enabling debugging in the cdc-acm driver and
  making sure that the corresponding control messages are indeed sent on
  close.
 
 I have a debugging dump at
 http://www.esperi.org.uk/~nix/temporary/cdc-acm.log; it's fairly

 What kernel were you using here? The log seems to suggest that it was
 generated with the commit in question reverted.

Wurgle. I think I was probably using the wrong one, as you suggest.
Hell. That's no use at all is it. Still, at least you know what it looks
like when it works. :)

I'll retry with the right one.

 What kernel version are you using? And do you have autosuspend enabled?

3.16.6 (but it's happened for 3.16.* obviously). No autosuspend (no
*way* can this box suspend, but it only draws a couple of watts so I
don't care).

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-22 Thread Nix

On 22 Oct 2014, Johan Hovold outgrape:

 On Wed, Oct 22, 2014 at 10:31:17AM +0100, Nix wrote:
 On 14 Oct 2014, Johan Hovold verbalised:
 
  On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
  I have checked: this code is being executed against a symlink that
  points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
  succeeding on the kernel I'm running now, but of course that's 3.16.5
  with this commit reverted...)
 
  You could verify that by enabling debugging in the cdc-acm driver and
  making sure that the corresponding control messages are indeed sent on
  close.
 
 I have a debugging dump at
 http://www.esperi.org.uk/~nix/temporary/cdc-acm.log; it's fairly

 What kernel were you using here? The log seems to suggest that it was
 generated with the commit in question reverted.

Look now :) (the previous log is still there in cdc-acm-reverted.log.)

This contains two boots -- one on which the USB key worked, and the next
(with an identical kernel) with which the key was broken. (I'm not sure
whether this problem happens at startup or shutdown time, so it seemed
best to provide both.)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-17 Thread Nix

On 14 Oct 2014, Johan Hovold outgrape:
> Another thing you could try is to add back the 
>
>   acm_set_control(acm, 0);
>
> just after the dev_info message in probe.

"Add back" suggests that this line existed before this change. It
didn't, as far as I can see. Probing has

acm_set_control(acm, acm->ctrlout);

*shutdown* has

acm_set_control(acm, acm->ctrlout = 0);

Did you mean that I should try adding back the acm_set_control() during
shutdown, or the one during probe?

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-17 Thread Nix

On 14 Oct 2014, Johan Hovold outgrape:
 Another thing you could try is to add back the 

   acm_set_control(acm, 0);

 just after the dev_info message in probe.

Add back suggests that this line existed before this change. It
didn't, as far as I can see. Probing has

acm_set_control(acm, acm-ctrlout);

*shutdown* has

acm_set_control(acm, acm-ctrlout = 0);

Did you mean that I should try adding back the acm_set_control() during
shutdown, or the one during probe?

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-14 Thread Nix

On 14 Oct 2014, Johan Hovold spake thusly:

> On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
>> On 12 Oct 2014, Johan Hovold verbalised:
>> 
>> > On Sat, Oct 11, 2014 at 11:24:59PM +0100, Nix wrote:
>> >> On 11 Oct 2014, Paul Martin spake thusly:
>> >> 
>> >> > Having been privy to the firmware of the eKey, it is very simplisting,
>> >> > with no implementation whatsoever of any flow control.
>> >> 
>> >> That's what I thought. (Why would something that just provides data at a
>> >> constant rate way below that of even the slowest USB bus *need* flow
>> >> control?)
>> >> 
>> >> One presumes therefore that the kernel suddenly trying to do flow
>> >> control on shutdown would fubar the firmware's internal state, leading
>> >> to the symptoms I see.
>> >
>> > The cdc-acm driver was dropping DTR/RTS on shutdown (close) also before
>> > the commit you refer to. One thing it did change however is that this is
>> > now only done if HUPCL is set. Might setting that flag be enough to
>> > prevent the device firmware from crashing?
>> 
>> If I read the ekeyd 1.1.5 source code correctly, this is already
>> happening:
>> 
>> ,[ host/stream.c:estream_open() ]
>> | } else if (S_ISCHR(sbuf.st_mode)) {
>> | /* Open the file as a character device/tty */
>> | fd = open(uri, O_RDWR | O_NOCTTY);
>> | if ((fd != -1) && (isatty(fd))) {
>> | if (tcgetattr(fd, ) == 0 ) {
>> | settings.c_cflag &= ~(CSIZE | CSTOPB | PARENB | CLOCAL |
>> |   CREAD | PARODD | CRTSCTS);
>> | settings.c_iflag &= ~(BRKINT | IGNPAR | PARMRK | INPCK |
>> |   ISTRIP | INLCR | IGNCR | ICRNL | 
>> IXON |
>> |   IXOFF  | IXANY | IMAXBEL);
>> | settings.c_iflag |= IGNBRK;
>> | settings.c_oflag &= ~(OPOST | OCRNL | ONOCR | ONLRET);
>> | settings.c_lflag &= ~(ISIG | ICANON | IEXTEN | ECHO |
>> |   ECHOE | ECHOK | ECHONL | NOFLSH |
>> |   TOSTOP | ECHOPRT | ECHOCTL | ECHOKE);
>> | settings.c_cflag |= CS8 | HUPCL | CREAD | CLOCAL;
>> | #ifdef EKEY_FULL_TERMIOS
>> | settings.c_cflag &= ~(CBAUD);
>> |  settings.c_iflag &= ~(IUTF8 | IUCLC);
>> | settings.c_oflag &= ~(OFILL | OFDEL | NLDLY | CRDLY | 
>> TABDLY |
>> |   BSDLY | VTDLY | FFDLY | OLCUC );
>> | settings.c_oflag |= NL0 | CR0 | TAB0 | BS0 | VT0 | FF0;
>> | settings.c_lflag &= ~(XCASE);
>> | #endif
>> | settings.c_cflag |= B115200;
>> | if (tcsetattr(fd, TCSANOW, ) < 0) {
>> `
>> 
>> Note the HUPCL in there.
>> 
>> I have checked: this code is being executed against a symlink that
>> points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
>> succeeding on the kernel I'm running now, but of course that's 3.16.5
>> with this commit reverted...)
>
> You could verify that by enabling debugging in the cdc-acm driver and
> making sure that the corresponding control messages are indeed sent on
> close.

Yeah, good idea -- in the specific context of the system rebooting, in
particular (though I'll also see if just killing the daemon causes this
problem, something I haven't tested). I was assuming it would be hard to
see it before the reboot process cleared the screen -- but of course
this machine has a serial console so that's not important.

> But you haven't seen any fw crashes since you reverted the commit in
> question?

Not one.

> Another thing you could try is to add back the 
>
>   acm_set_control(acm, 0);
>
> just after the dev_info message in probe.

I'll try that tonight.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-14 Thread Nix

On 14 Oct 2014, Johan Hovold spake thusly:

 On Sun, Oct 12, 2014 at 10:36:30PM +0100, Nix wrote:
 On 12 Oct 2014, Johan Hovold verbalised:
 
  On Sat, Oct 11, 2014 at 11:24:59PM +0100, Nix wrote:
  On 11 Oct 2014, Paul Martin spake thusly:
  
   Having been privy to the firmware of the eKey, it is very simplisting,
   with no implementation whatsoever of any flow control.
  
  That's what I thought. (Why would something that just provides data at a
  constant rate way below that of even the slowest USB bus *need* flow
  control?)
  
  One presumes therefore that the kernel suddenly trying to do flow
  control on shutdown would fubar the firmware's internal state, leading
  to the symptoms I see.
 
  The cdc-acm driver was dropping DTR/RTS on shutdown (close) also before
  the commit you refer to. One thing it did change however is that this is
  now only done if HUPCL is set. Might setting that flag be enough to
  prevent the device firmware from crashing?
 
 If I read the ekeyd 1.1.5 source code correctly, this is already
 happening:
 
 ,[ host/stream.c:estream_open() ]
 | } else if (S_ISCHR(sbuf.st_mode)) {
 | /* Open the file as a character device/tty */
 | fd = open(uri, O_RDWR | O_NOCTTY);
 | if ((fd != -1)  (isatty(fd))) {
 | if (tcgetattr(fd, settings) == 0 ) {
 | settings.c_cflag = ~(CSIZE | CSTOPB | PARENB | CLOCAL |
 |   CREAD | PARODD | CRTSCTS);
 | settings.c_iflag = ~(BRKINT | IGNPAR | PARMRK | INPCK |
 |   ISTRIP | INLCR | IGNCR | ICRNL | 
 IXON |
 |   IXOFF  | IXANY | IMAXBEL);
 | settings.c_iflag |= IGNBRK;
 | settings.c_oflag = ~(OPOST | OCRNL | ONOCR | ONLRET);
 | settings.c_lflag = ~(ISIG | ICANON | IEXTEN | ECHO |
 |   ECHOE | ECHOK | ECHONL | NOFLSH |
 |   TOSTOP | ECHOPRT | ECHOCTL | ECHOKE);
 | settings.c_cflag |= CS8 | HUPCL | CREAD | CLOCAL;
 | #ifdef EKEY_FULL_TERMIOS
 | settings.c_cflag = ~(CBAUD);
 |  settings.c_iflag = ~(IUTF8 | IUCLC);
 | settings.c_oflag = ~(OFILL | OFDEL | NLDLY | CRDLY | 
 TABDLY |
 |   BSDLY | VTDLY | FFDLY | OLCUC );
 | settings.c_oflag |= NL0 | CR0 | TAB0 | BS0 | VT0 | FF0;
 | settings.c_lflag = ~(XCASE);
 | #endif
 | settings.c_cflag |= B115200;
 | if (tcsetattr(fd, TCSANOW, settings)  0) {
 `
 
 Note the HUPCL in there.
 
 I have checked: this code is being executed against a symlink that
 points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
 succeeding on the kernel I'm running now, but of course that's 3.16.5
 with this commit reverted...)

 You could verify that by enabling debugging in the cdc-acm driver and
 making sure that the corresponding control messages are indeed sent on
 close.

Yeah, good idea -- in the specific context of the system rebooting, in
particular (though I'll also see if just killing the daemon causes this
problem, something I haven't tested). I was assuming it would be hard to
see it before the reboot process cleared the screen -- but of course
this machine has a serial console so that's not important.

 But you haven't seen any fw crashes since you reverted the commit in
 question?

Not one.

 Another thing you could try is to add back the 

   acm_set_control(acm, 0);

 just after the dev_info message in probe.

I'll try that tonight.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-12 Thread Nix

On 12 Oct 2014, Johan Hovold verbalised:

> On Sat, Oct 11, 2014 at 11:24:59PM +0100, Nix wrote:
>> On 11 Oct 2014, Paul Martin spake thusly:
>> 
>> > Having been privy to the firmware of the eKey, it is very simplisting,
>> > with no implementation whatsoever of any flow control.
>> 
>> That's what I thought. (Why would something that just provides data at a
>> constant rate way below that of even the slowest USB bus *need* flow
>> control?)
>> 
>> One presumes therefore that the kernel suddenly trying to do flow
>> control on shutdown would fubar the firmware's internal state, leading
>> to the symptoms I see.
>
> The cdc-acm driver was dropping DTR/RTS on shutdown (close) also before
> the commit you refer to. One thing it did change however is that this is
> now only done if HUPCL is set. Might setting that flag be enough to
> prevent the device firmware from crashing?

If I read the ekeyd 1.1.5 source code correctly, this is already
happening:

,[ host/stream.c:estream_open() ]
| } else if (S_ISCHR(sbuf.st_mode)) {
| /* Open the file as a character device/tty */
| fd = open(uri, O_RDWR | O_NOCTTY);
| if ((fd != -1) && (isatty(fd))) {
| if (tcgetattr(fd, ) == 0 ) {
| settings.c_cflag &= ~(CSIZE | CSTOPB | PARENB | CLOCAL |
|   CREAD | PARODD | CRTSCTS);
| settings.c_iflag &= ~(BRKINT | IGNPAR | PARMRK | INPCK |
|   ISTRIP | INLCR | IGNCR | ICRNL | IXON |
|   IXOFF  | IXANY | IMAXBEL);
| settings.c_iflag |= IGNBRK;
| settings.c_oflag &= ~(OPOST | OCRNL | ONOCR | ONLRET);
| settings.c_lflag &= ~(ISIG | ICANON | IEXTEN | ECHO |
|   ECHOE | ECHOK | ECHONL | NOFLSH |
|   TOSTOP | ECHOPRT | ECHOCTL | ECHOKE);
| settings.c_cflag |= CS8 | HUPCL | CREAD | CLOCAL;
| #ifdef EKEY_FULL_TERMIOS
| settings.c_cflag &= ~(CBAUD);
| settings.c_iflag &= ~(IUTF8 | IUCLC);
| settings.c_oflag &= ~(OFILL | OFDEL | NLDLY | CRDLY | TABDLY |
|   BSDLY | VTDLY | FFDLY | OLCUC );
| settings.c_oflag |= NL0 | CR0 | TAB0 | BS0 | VT0 | FF0;
| settings.c_lflag &= ~(XCASE);
| #endif
| settings.c_cflag |= B115200;
| if (tcsetattr(fd, TCSANOW, ) < 0) {
`

Note the HUPCL in there.

I have checked: this code is being executed against a symlink that
points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
succeeding on the kernel I'm running now, but of course that's 3.16.5
with this commit reverted...)

Of course the daemon is stopped before the reboot, closing the device.
(But even if it wasn't, one would assume that the fact the system was
rebooting would be considered tantamount to a close!)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-12 Thread Nix

On 12 Oct 2014, Johan Hovold verbalised:

 On Sat, Oct 11, 2014 at 11:24:59PM +0100, Nix wrote:
 On 11 Oct 2014, Paul Martin spake thusly:
 
  Having been privy to the firmware of the eKey, it is very simplisting,
  with no implementation whatsoever of any flow control.
 
 That's what I thought. (Why would something that just provides data at a
 constant rate way below that of even the slowest USB bus *need* flow
 control?)
 
 One presumes therefore that the kernel suddenly trying to do flow
 control on shutdown would fubar the firmware's internal state, leading
 to the symptoms I see.

 The cdc-acm driver was dropping DTR/RTS on shutdown (close) also before
 the commit you refer to. One thing it did change however is that this is
 now only done if HUPCL is set. Might setting that flag be enough to
 prevent the device firmware from crashing?

If I read the ekeyd 1.1.5 source code correctly, this is already
happening:

,[ host/stream.c:estream_open() ]
| } else if (S_ISCHR(sbuf.st_mode)) {
| /* Open the file as a character device/tty */
| fd = open(uri, O_RDWR | O_NOCTTY);
| if ((fd != -1)  (isatty(fd))) {
| if (tcgetattr(fd, settings) == 0 ) {
| settings.c_cflag = ~(CSIZE | CSTOPB | PARENB | CLOCAL |
|   CREAD | PARODD | CRTSCTS);
| settings.c_iflag = ~(BRKINT | IGNPAR | PARMRK | INPCK |
|   ISTRIP | INLCR | IGNCR | ICRNL | IXON |
|   IXOFF  | IXANY | IMAXBEL);
| settings.c_iflag |= IGNBRK;
| settings.c_oflag = ~(OPOST | OCRNL | ONOCR | ONLRET);
| settings.c_lflag = ~(ISIG | ICANON | IEXTEN | ECHO |
|   ECHOE | ECHOK | ECHONL | NOFLSH |
|   TOSTOP | ECHOPRT | ECHOCTL | ECHOKE);
| settings.c_cflag |= CS8 | HUPCL | CREAD | CLOCAL;
| #ifdef EKEY_FULL_TERMIOS
| settings.c_cflag = ~(CBAUD);
| settings.c_iflag = ~(IUTF8 | IUCLC);
| settings.c_oflag = ~(OFILL | OFDEL | NLDLY | CRDLY | TABDLY |
|   BSDLY | VTDLY | FFDLY | OLCUC );
| settings.c_oflag |= NL0 | CR0 | TAB0 | BS0 | VT0 | FF0;
| settings.c_lflag = ~(XCASE);
| #endif
| settings.c_cflag |= B115200;
| if (tcsetattr(fd, TCSANOW, settings)  0) {
`

Note the HUPCL in there.

I have checked: this code is being executed against a symlink that
points to /dev/ttyACM0, and the tcsetattr() succeeds. (At least, it's
succeeding on the kernel I'm running now, but of course that's 3.16.5
with this commit reverted...)

Of course the daemon is stopped before the reboot, closing the device.
(But even if it wasn't, one would assume that the fact the system was
rebooting would be considered tantamount to a close!)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-11 Thread Nix

On 11 Oct 2014, Paul Martin spake thusly:

> Having been privy to the firmware of the eKey, it is very simplisting,
> with no implementation whatsoever of any flow control.

That's what I thought. (Why would something that just provides data at a
constant rate way below that of even the slowest USB bus *need* flow
control?)

One presumes therefore that the kernel suddenly trying to do flow
control on shutdown would fubar the firmware's internal state, leading
to the symptoms I see.

So, the question becomes, is there a way to spot this general 'no flow
control on this device' thing from the kernel side, or do we need a
blacklist? Or, perhaps, if this is commonplace for cdc-acm devices, a
whitelist? I can't imagine it's *that* commonplace or someone would have
spotted this already in the months and months it took me to do the
bisection.

Maybe all non-modem cdc-acm devices should eschew flow control, or
something? (This is a genuine guess and is almost certainly wrong.)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-11 Thread Nix

[Cc:ed someone who knows the people behind the Entropy Key: they're not
 being manufactured at the moment, but he might want to know anyway]

On 5 Sep 2014, n...@esperi.org.uk stated:

> On 1 Sep 2014, Oliver Neukum stated:
>
>>> I'll do a bisection of the cdc-acm changes since 3.15 tomorrow night and
>>> see if I can find the commit at fault.
>>
>> Thank you for the report. Please let me know the results of your
>> bisection.
>
> Bisection underway (fifth attempt -- I *may* have characterized it well
> enough after a few hours of thrashing at it to bisect accurately this
> time).
[...]
> More generally, the problem may be at *shutdown* -- something goes wrong
> during link suspension or something, such that the link never comes up
> again until physically reconnected. So a straight bisect is misleading
> -- the error may have been in the *last* kernel tested -- and even then,
> some kernels (e.g. the 3.15.0 merge base) appear capable of making it
> work fine. But even this is not consistent: sometimes a kernel that
> works fine if you repeatedly reboot it (such as 3.15) malfunctions when
> you reboot into 3.16 -- but sometimes a newly plugged USB key on a 3.16
> kernel malfunctions upon reboot, even if you reboot into a working
> kernel such as 3.15 (and it then proceeds to work indefinitely if you
> unplug and replug it and stick with 3.15.x, but upon rebooting into
> 3.16.x it goes wrong again).

*Finally* bisected, not helped by the fact that I sometimes needed up to
five reboots (!) to see a failure. The guilty commit is this one:

commit 0943d8ead30e9474034cc5e92225ab0fd29fd0d4
Author: Johan Hovold 
Date:   Mon May 26 19:23:51 2014 +0200

USB: cdc-acm: use tty-port dtr_rts

Add dtr_rts tty-port operation which implements proper DTR/RTS handling
(e.g. only lower DTR/RTS during shutdown if HUPCL is set).

Note that modem-control locking still needs to be added throughout the
driver.

Signed-off-by: Johan Hovold 
Signed-off-by: Greg Kroah-Hartman 

To re-describe this failure for the people who weren't in the thread: in
3.16.x I often see this output when asking the ekey daemon for the state
of my Simtec Entropy Key (a cdc-acm-based random number generator) after
rebooting my ohci-based Soekris net5501:

fold:~# ekeydctl stats 1
BytesRead=0
BytesWritten=0
ConnectionNonces=0
ConnectionPackets=0
ConnectionRekeys=0
ConnectionResets=0
ConnectionTime=65
EntropyRate=0
FipsFrameRate=0
FrameByteLast=0
FramesOk=0
FramingErrors=0
KeyDbsdShannonPerByteL=0
KeyDbsdShannonPerByteR=0
KeyEnglishBadness=No failure
KeyRawBadness=0
KeyRawShannonPerByteL=0
KeyRawShannonPerByteR=0
KeyRawShannonPerByteX=0
KeyShortBadness=efm_ok
KeyTemperatureC=-273.15
KeyTemperatureF=-459.67
KeyTemperatureK=0
KeyVoltage=0
PacketErrors=0
PacketOK=0
ReadRate=0
TotalEntropy=0
WriteRate=0

This device streams data continuously at at rate of several KiB/s, so
normally we would never expect to see a report of zero bytes read or
written if the key were functional (nor, indeed, a key temperature of
absolute zero!). This failure never occurred in 3.15.x nor any earlier
kernel. (Note: the 'no failure' message above is sent *from the key* to
indicate that the random numbers can be trusted: it is a bit unfortunate
that the code for 'No failure' is 0, which is also the default value
before anything is received from the key. In this case, we're just
seeing the daemon's initialization-time default. As BytesRead indicates,
the key is not talking to us.)

The symptoms are such that it is the kernel you reboot *from* that
causes the failure, not the one you reboot into: once the key fails it
never recovers without physical removal and reinsertion (or, one
presumes, a poweroff of the whole machine, but I haven't tried that)

This is not a consistent failure: sometimes it can take up to four
reboots for the key to fail. As a result, the bisection took forever (I
had to wait until I had a spare weekend day to devote to it). Despite
the errative nature, I'm fairly confident this commit is at fault: with
it reverted, I have restarted a couple of dozen times without failure
symptoms.

(I speculate that the device's firmware may be terminally confused by
having something try to hang it up, since it's not a modem nor anything
like one, as the boot messages correctly proclaim. The firmware isn't
open, so I can't check.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-11 Thread Nix

[Cc:ed someone who knows the people behind the Entropy Key: they're not
 being manufactured at the moment, but he might want to know anyway]

On 5 Sep 2014, n...@esperi.org.uk stated:

 On 1 Sep 2014, Oliver Neukum stated:

 I'll do a bisection of the cdc-acm changes since 3.15 tomorrow night and
 see if I can find the commit at fault.

 Thank you for the report. Please let me know the results of your
 bisection.

 Bisection underway (fifth attempt -- I *may* have characterized it well
 enough after a few hours of thrashing at it to bisect accurately this
 time).
[...]
 More generally, the problem may be at *shutdown* -- something goes wrong
 during link suspension or something, such that the link never comes up
 again until physically reconnected. So a straight bisect is misleading
 -- the error may have been in the *last* kernel tested -- and even then,
 some kernels (e.g. the 3.15.0 merge base) appear capable of making it
 work fine. But even this is not consistent: sometimes a kernel that
 works fine if you repeatedly reboot it (such as 3.15) malfunctions when
 you reboot into 3.16 -- but sometimes a newly plugged USB key on a 3.16
 kernel malfunctions upon reboot, even if you reboot into a working
 kernel such as 3.15 (and it then proceeds to work indefinitely if you
 unplug and replug it and stick with 3.15.x, but upon rebooting into
 3.16.x it goes wrong again).

*Finally* bisected, not helped by the fact that I sometimes needed up to
five reboots (!) to see a failure. The guilty commit is this one:

commit 0943d8ead30e9474034cc5e92225ab0fd29fd0d4
Author: Johan Hovold jhov...@gmail.com
Date:   Mon May 26 19:23:51 2014 +0200

USB: cdc-acm: use tty-port dtr_rts

Add dtr_rts tty-port operation which implements proper DTR/RTS handling
(e.g. only lower DTR/RTS during shutdown if HUPCL is set).

Note that modem-control locking still needs to be added throughout the
driver.

Signed-off-by: Johan Hovold jhov...@gmail.com
Signed-off-by: Greg Kroah-Hartman gre...@linuxfoundation.org

To re-describe this failure for the people who weren't in the thread: in
3.16.x I often see this output when asking the ekey daemon for the state
of my Simtec Entropy Key (a cdc-acm-based random number generator) after
rebooting my ohci-based Soekris net5501:

fold:~# ekeydctl stats 1
BytesRead=0
BytesWritten=0
ConnectionNonces=0
ConnectionPackets=0
ConnectionRekeys=0
ConnectionResets=0
ConnectionTime=65
EntropyRate=0
FipsFrameRate=0
FrameByteLast=0
FramesOk=0
FramingErrors=0
KeyDbsdShannonPerByteL=0
KeyDbsdShannonPerByteR=0
KeyEnglishBadness=No failure
KeyRawBadness=0
KeyRawShannonPerByteL=0
KeyRawShannonPerByteR=0
KeyRawShannonPerByteX=0
KeyShortBadness=efm_ok
KeyTemperatureC=-273.15
KeyTemperatureF=-459.67
KeyTemperatureK=0
KeyVoltage=0
PacketErrors=0
PacketOK=0
ReadRate=0
TotalEntropy=0
WriteRate=0

This device streams data continuously at at rate of several KiB/s, so
normally we would never expect to see a report of zero bytes read or
written if the key were functional (nor, indeed, a key temperature of
absolute zero!). This failure never occurred in 3.15.x nor any earlier
kernel. (Note: the 'no failure' message above is sent *from the key* to
indicate that the random numbers can be trusted: it is a bit unfortunate
that the code for 'No failure' is 0, which is also the default value
before anything is received from the key. In this case, we're just
seeing the daemon's initialization-time default. As BytesRead indicates,
the key is not talking to us.)

The symptoms are such that it is the kernel you reboot *from* that
causes the failure, not the one you reboot into: once the key fails it
never recovers without physical removal and reinsertion (or, one
presumes, a poweroff of the whole machine, but I haven't tried that)

This is not a consistent failure: sometimes it can take up to four
reboots for the key to fail. As a result, the bisection took forever (I
had to wait until I had a spare weekend day to devote to it). Despite
the errative nature, I'm fairly confident this commit is at fault: with
it reverted, I have restarted a couple of dozen times without failure
symptoms.

(I speculate that the device's firmware may be terminally confused by
having something try to hang it up, since it's not a modem nor anything
like one, as the boot messages correctly proclaim. The firmware isn't
open, so I can't check.)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 BISECTED REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-10-11 Thread Nix

On 11 Oct 2014, Paul Martin spake thusly:

 Having been privy to the firmware of the eKey, it is very simplisting,
 with no implementation whatsoever of any flow control.

That's what I thought. (Why would something that just provides data at a
constant rate way below that of even the slowest USB bus *need* flow
control?)

One presumes therefore that the kernel suddenly trying to do flow
control on shutdown would fubar the firmware's internal state, leading
to the symptoms I see.

So, the question becomes, is there a way to spot this general 'no flow
control on this device' thing from the kernel side, or do we need a
blacklist? Or, perhaps, if this is commonplace for cdc-acm devices, a
whitelist? I can't imagine it's *that* commonplace or someone would have
spotted this already in the months and months it took me to do the
bisection.

Maybe all non-modem cdc-acm devices should eschew flow control, or
something? (This is a genuine guess and is almost certainly wrong.)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-09-08 Thread Nix

On 8 Sep 2014, Oliver Neukum said:

> On Fri, 2014-09-05 at 16:17 +0100, Nix wrote:
>> On 5 Sep 2014, Oliver Neukum verbalised:
>> 
>> > On Fri, 2014-09-05 at 00:40 +0100, Nix wrote:
>> >> I'm working around this confusing morass by rebooting into each test
>> >> kernel, unplugging and replugging the entropy key if it was fubared,
>> >> then rebooting into the same kernel again and seeing if it was still
>> >> fubared. But this is not terribly fast, particularly not on a headless
>> >> compact-flash-based Geode box which doesn't even complete booting
>> >> without the entropy source which this bug cuts off :) so it'll be
>> >> sometime tomorrow before I can get this bisection done, I'm afraid.
>> >
>> > Ugh. My sympathies. I cannot suggest a better method, I am afraid.
>> 
>> Well, that method doesn't work. I've found pairs of kernels (e.g.
>> 59a3d4c3631e553357b7305dc09db1990aa6757c and
>> b05d59dfceaea72565b1648af929b037b0f96d7f) where each kernel works on its
>> own (rebooting from that kernel into the same kernel keeps a working
>> key, so I would normally assume that each kernel is OK) but rebooting
>> from the first into the second yields a broken one if it was working
>> before (so one of them must, in fact, be broken, but I have no clue
>> which one).
>> 
>> So I can't figure out how to bisect this.
>> 
>> Any suggestions as to what failure-test I might use, or what other
>> methods I might use to figure out what's going wrong? Not knowing
>> anything about USB doesn't help here. I don't know for sure that this is
>> a cdc-acm problem -- bisecting just the cdc-acm driver was fruitless --
>> so it might be something more generally USBish.
>
> Do your kernels work if you start with a known good kernel e.g.
> 3.15 and then reboot?

That case works -- aha, so I could orchestrate it by going

3.15 -> test -> test -> reset to 3.15 -> test -> test ...

i.e. a triple reboot cycle. Should have thought of that.

God, what a pain :)

I'll give it a try tonight.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-09-08 Thread Nix

On 8 Sep 2014, Oliver Neukum said:

 On Fri, 2014-09-05 at 16:17 +0100, Nix wrote:
 On 5 Sep 2014, Oliver Neukum verbalised:
 
  On Fri, 2014-09-05 at 00:40 +0100, Nix wrote:
  I'm working around this confusing morass by rebooting into each test
  kernel, unplugging and replugging the entropy key if it was fubared,
  then rebooting into the same kernel again and seeing if it was still
  fubared. But this is not terribly fast, particularly not on a headless
  compact-flash-based Geode box which doesn't even complete booting
  without the entropy source which this bug cuts off :) so it'll be
  sometime tomorrow before I can get this bisection done, I'm afraid.
 
  Ugh. My sympathies. I cannot suggest a better method, I am afraid.
 
 Well, that method doesn't work. I've found pairs of kernels (e.g.
 59a3d4c3631e553357b7305dc09db1990aa6757c and
 b05d59dfceaea72565b1648af929b037b0f96d7f) where each kernel works on its
 own (rebooting from that kernel into the same kernel keeps a working
 key, so I would normally assume that each kernel is OK) but rebooting
 from the first into the second yields a broken one if it was working
 before (so one of them must, in fact, be broken, but I have no clue
 which one).
 
 So I can't figure out how to bisect this.
 
 Any suggestions as to what failure-test I might use, or what other
 methods I might use to figure out what's going wrong? Not knowing
 anything about USB doesn't help here. I don't know for sure that this is
 a cdc-acm problem -- bisecting just the cdc-acm driver was fruitless --
 so it might be something more generally USBish.

 Do your kernels work if you start with a known good kernel e.g.
 3.15 and then reboot?

That case works -- aha, so I could orchestrate it by going

3.15 - test - test - reset to 3.15 - test - test ...

i.e. a triple reboot cycle. Should have thought of that.

God, what a pain :)

I'll give it a try tonight.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Looks like a broken hub? (was Re: 3.16.2: 2TiB Seagate Expansion Desk apparently still broken with both USB mass storage and UAS: some debugging output)

2014-09-07 Thread Nix

[linux-scsi dropped, this is not a scsi or uas problem.]

On 7 Sep 2014, n...@esperi.org.uk stated:
> And... now it works, at least well enough to get a device file. So it's
> not the disk that's at fault: it's the no-name hub! (Which is, I think,
> USB ID 2109:0811 -- at least two instances of this disappear when I
> unplug the hub.)

Confirmed. Plugging a known-good (non-UAS) disk into the questionable
hub yields this mass of screaming:

Sep  7 23:19:29 mutilate info: : [  161.026517] usb 6-1.1.2: new SuperSpeed USB 
device number 5 using xhci_hcd
Sep  7 23:19:29 mutilate info: : [  161.041767] usb-storage 6-1.1.2:1.0: USB 
Mass Storage device detected
Sep  7 23:19:29 mutilate info: : [  161.043404] scsi8 : usb-storage 6-1.1.2:1.0
Sep  7 23:19:30 mutilate notice: : [  162.046725] scsi 8:0:0:0: Direct-Access   
  WD   My Book 1140 1012 PQ: 0 ANSI: 6
Sep  7 23:19:30 mutilate notice: : [  162.048476] scsi 8:0:0:1: Enclosure   
  WD   SES Device   1012 PQ: 0 ANSI: 6
Sep  7 23:19:30 mutilate notice: : [  162.056190] sd 8:0:0:0: [sdc] Spinning up 
disk...
Sep  7 23:19:30 fold warning: : [198061.014106] packet denied IN=bdsl OUT= 
MAC=00:00:24:cb:c6:a2:50:67:f0:8c:bf:8f:08:00 SRC=109.120.181.179 
DST=81.187.191.133 LEN=28 TOS=0x00 PREC=0x00 TTL=249 ID=32820 PROTO=UDP SPT=54
642 DPT=623 LEN=8
Sep  7 23:19:39 mutilate warning: : [  163.058746] .ready
Sep  7 23:19:39 mutilate notice: : [  171.085320] sd 8:0:0:0: [sdc] 3906963456 
512-byte logical blocks: (2.00 TB/1.81 TiB)
Sep  7 23:19:39 mutilate notice: : [  171.087788] sd 8:0:0:0: [sdc] Write 
Protect is off
Sep  7 23:19:39 mutilate err: : [  171.090237] sd 8:0:0:0: [sdc] No Caching 
mode page found
Sep  7 23:19:39 mutilate err: : [  171.091720] sd 8:0:0:0: [sdc] Assuming drive 
cache: write through
Sep  7 23:19:39 mutilate info: : [  171.113631]  sdc: sdc1
Sep  7 23:19:39 mutilate notice: : [  171.117370] sd 8:0:0:0: [sdc] Attached 
SCSI disk
Sep  7 23:19:40 mutilate info: : [  171.525994] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:40 mutilate warning: : [  171.538771] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:40 mutilate warning: : [  171.540317] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:40 mutilate info: : [  171.807110] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:40 mutilate warning: : [  171.819716] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:40 mutilate warning: : [  171.821277] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:40 mutilate info: : [  172.082789] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:40 mutilate warning: : [  172.095409] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:40 mutilate warning: : [  172.097043] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:50 fold warning: : [198080.980957] packet denied IN=bdsl OUT= 
MAC=00:00:24:cb:c6:a2:50:67:f0:8c:bf:8f:08:00 SRC=71.6.165.200 
DST=81.187.191.133 LEN=40 TOS=0x00 PREC=0x00 TTL=111 ID=8287 PROTO=TCP SPT=7987 
D
PT=7071 WINDOW=60597 RES=0x00 SYN URGP=0
Sep  7 23:19:56 mutilate info: : [  188.333974] EXT4-fs (sdc1): mounted 
filesystem without journal. Opts: (null)
Sep  7 23:19:59 mutilate info: : [  190.584332] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:59 mutilate warning: : [  190.597220] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:59 mutilate warning: : [  190.598996] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:59 mutilate info: : [  190.602042] sd 8:0:0:0: [sdc] Unhandled 
error code
Sep  7 23:19:59 mutilate info: : [  190.603677] sd 8:0:0:0: [sdc]
Sep  7 23:19:59 mutilate warning: : [  190.605251] Result: hostbyte=0x07 
driverbyte=0x00
Sep  7 23:19:59 mutilate info: : [  190.606861] sd 8:0:0:0: [sdc] CDB:
Sep  7 23:19:59 mutilate warning: : [  190.608456] cdb[0]=0x28: 28 00 27 00 0c 
18 00 00 f0 00
Sep  7 23:19:59 mutilate err: : [  190.610124] end_request: I/O error, dev sdc, 
sector 654314520

So either the hub is shagged, or the new USB extension cable that is the
only way either of these drives can physically reach the hub is shagged.
You'd think they wouldn't make hubs so bad that USB mass storage didn't
work, so maybe it's the cable?

I'll see if the new drive actually works well enough to not get horribly
corrupted doing a backup tomorrow :) if it does, I guess Alexandre's bug
is kind of fixed after all, maybe? At least with my slightly different
drive.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a

Looks like a broken hub? (was Re: 3.16.2: 2TiB Seagate Expansion Desk apparently still broken with both USB mass storage and UAS: some debugging output)

2014-09-07 Thread Nix

On 7 Sep 2014, Alan Stern spake thusly:

> On Sun, 7 Sep 2014, Nix wrote:
>
>> I have a brand new Seagate Expansion Desk drive attached to my x86-64
>> desktop. (I also have a 4TiB model of the same drive, but I haven't even
>> unboxed it: there seems little point as long as the 2TiB version doesn't
>> work.) I am seeing apparently the same problem as Alexandre Oliva
>> reported in <https://bugzilla.kernel.org/show_bug.cgi?id=79511>. The
>> drive is USB ID 0bc2:3321, so probably a slightly different model than
>> Alexandre's 0bc2:3320, but similar enough to be broken it seems.
>> 
>> I've tried it with both the usb-storage driver on xhci and with UAS,
>> with verbose USB mass storage debugging turned on. Both fail with
>> different log messages: see below. I haven't tried Alexandre's quirk
>> yet. (I have USB debugging for the UAS case only, alas, because the
>> system helpfully autoloaded and used that module when I was trying to
>> replicate the usb-storage-only failure, sigh. Autoloading is annoying
>> sometimes! :) )
>
> ...
>
>> I'm happy to do whatever further debugging people may deem necessary:
>> this system has up-to-date backups and is easy to reboot and try new
>> kernels on, and the drive is brand new and completely empty so it's
>> pretty much impossible to mess up!
>
> Please post a usbmon trace showing what happens when the drive binds to 
> usb-storage.  To prevent extraneous data from cluttering the trace, you 
> should unplug as many of the other USB devices on the same bus as 
> possible before starting the trace.

Done. I dropped the no-name hub out of the equation by taking over my
existing backup device's USB link (which also meant I could figure out
the bus number, since it doesn't change while the system is running :) ).

And... now it works, at least well enough to get a device file. So it's
not the disk that's at fault: it's the no-name hub! (Which is, I think,
USB ID 2109:0811 -- at least two instances of this disappear when I
unplug the hub.)

Attached, usbmon output from a successful negotiation, and a failed one
via the hub.

I am more than slightly annoyed. This machine only *has* two physical
USB3 ports, both on the same bus, and they're almost totally
inaccessible round the back, because of course you'd want your fastest
ports to be hardest to reach. The whole point of this hub was to fix
that stupid case-design decision.

So... anyone got any suggestions for a USB-3 hub I might buy that
doesn't mess things up for devices plugged into it? Do I just need one
new enough that it supports UAS? (And... how on earth can I tell before
buying?)

Sigh. Hardware sucks. (But maybe you can quirk around this...)

88041c19a3c0 1246876544 C Ii:6:001:1 0:2048 1 = 04
88041c19a3c0 1246876553 S Ii:6:001:1 -115:2048 4 <
880405a47480 1246876588 S Ci:6:001:0 s a3 00  0002 0004 4 <
880405a47480 1246876613 C Ci:6:001:0 0 4 = 03020100
8804058ced80 1246876646 S Co:6:001:0 s 23 01 0010 0002  0
8804058ced80 124687 C Co:6:001:0 0 0
8803508b8600 1246876690 S Ci:6:001:0 s a3 00  0002 0004 4 <
8803508b8600 1246876712 C Ci:6:001:0 0 4 = 0302
880405a470c0 1246902105 S Ci:6:001:0 s a3 00  0002 0004 4 <
880405a470c0 1246902117 C Ci:6:001:0 0 4 = 0302
88009bacd900 1246928126 S Ci:6:001:0 s a3 00  0002 0004 4 <
88009bacd900 1246928143 C Ci:6:001:0 0 4 = 0302
8803508b83c0 1246954116 S Ci:6:001:0 s a3 00  0002 0004 4 <
8803508b83c0 1246954127 C Ci:6:001:0 0 4 = 0302
8803508b8480 1246980134 S Ci:6:001:0 s a3 00  0002 0004 4 <
8803508b8480 1246980146 C Ci:6:001:0 0 4 = 0302
8803508b8e40 1246980371 S Ci:6:001:0 s a3 00  0002 0004 4 <
8803508b8e40 1246980383 C Ci:6:001:0 0 4 = 0302
880405a479c0 1246980425 S Co:6:001:0 s 23 03 0004 0002  0
880405a479c0 1246980440 C Co:6:001:0 0 0
8804058cef00 1247031101 S Ci:6:001:0 s a3 00  0002 0004 4 <
8804058cef00 1247031121 C Ci:6:001:0 0 4 = 03021000
8803508b8e40 1247082123 S Co:6:001:0 s 23 01 0014 0002  0
8803508b8e40 1247082146 C Co:6:001:0 0 0
880405a47d80 1247082177 S Co:6:001:0 s 23 01 001d 0002  0
880405a47d80 1247082201 C Co:6:001:0 0 0
8804058cef00 1247082228 S Co:6:001:0 s 23 01 0019 0002  0
8804058cef00 1247082241 C Co:6:001:0 0 0
8803508b8480 1247082284 S Co:6:001:0 s 23 01 0010 0002  0
8803508b8480 1247082312 C Co:6:001:0 0 0
880405a47d80 1247082345 S Ci:6:001:0 s a3 00  0002 0004 4 <
880405a47d80 1247082356 C Ci:6:001:0 0 4 = 0302
8803508b8600 1247096118 S Ci:6:010:0 s 80 06 0100  0008 8 <
8803508b8600 1247096214 C Ci:6:010:0 0 8 = 12010003 0009
8800b85e1840 1247096248 S Ci:6:010:0 s 80 06 0100  0012 18 <
8800b85e1840 1247096339 C

3.16.2: 2TiB Seagate Expansion Desk apparently still broken with both USB mass storage and UAS: some debugging output

2014-09-07 Thread Nix

I have a brand new Seagate Expansion Desk drive attached to my x86-64
desktop. (I also have a 4TiB model of the same drive, but I haven't even
unboxed it: there seems little point as long as the 2TiB version doesn't
work.) I am seeing apparently the same problem as Alexandre Oliva
reported in . The
drive is USB ID 0bc2:3321, so probably a slightly different model than
Alexandre's 0bc2:3320, but similar enough to be broken it seems.

I've tried it with both the usb-storage driver on xhci and with UAS,
with verbose USB mass storage debugging turned on. Both fail with
different log messages: see below. I haven't tried Alexandre's quirk
yet. (I have USB debugging for the UAS case only, alas, because the
system helpfully autoloaded and used that module when I was trying to
replicate the usb-storage-only failure, sigh. Autoloading is annoying
sometimes! :) )

It's attached via an 8-port powered USB-3 hub marked only with the logo
'U-speed', possibly USB id 2109:0811.

Note: this system has strange boot messages at startup which may relate
to this no-name USB hub:

Sep  7 18:06:05 mutilate info: : [   33.960925] usb 6-2: reset SuperSpeed USB 
device number 3 using xhci_hcd
Sep  7 18:06:05 mutilate warning: : [   33.974576] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b418cc0
Sep  7 18:06:05 mutilate warning: : [   33.976995] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b418d08

While the system is running with mass storage debugging turned on, I see
a constant pulse of

Sep  7 17:45:27 mutilate warning: : [  152.048083]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.049890] (unknown ASC/ASCQ)
Sep  7 17:45:27 mutilate warning: : [  152.051075]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.053181] (unknown ASC/ASCQ)
Sep  7 17:45:27 mutilate warning: : [  152.054378]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.056299] (unknown ASC/ASCQ)
Sep  7 17:45:27 mutilate warning: : [  152.057485]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.059434] (unknown ASC/ASCQ)
Sep  7 17:45:28 mutilate warning: : [  152.559807]  4a 01 00 00 10 00 00 00 08 
00

which may or may not relate to this device: it's definitely not directly
associated with the failing device though, since it's emitted even when
that device is off and unplugged. (I *do* have an always-mounted but
generally idle USB mass storage device here, a 2TiB WD My Book Essential
drive, USB ID 1058:1140: it may relate to that.)

Anyway: to the problem. When mounting the Seagate Expansion Desk drive
with usb-storage, I see:

Sep  6 10:47:39 mutilate info: : [474931.354699] usb 6-1.1.2: new SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:39 mutilate info: : [474931.367533] usb-storage 6-1.1.2:1.0: USB 
Mass Storage device detected
Sep  6 10:47:39 mutilate info: : [474931.367717] scsi13 : usb-storage 
6-1.1.2:1.0
Sep  6 10:47:40 mutilate notice: : [474932.368828] scsi 13:0:0:0: Direct-Access 
Seagate  Expansion Desk   0604 PQ: 0 ANSI: 6
Sep  6 10:47:40 mutilate notice: : [474932.371900] sd 13:0:0:0: [sdh] Spinning 
up disk...
Sep  6 10:47:44 mutilate warning: : [474933.371782] ...
Sep  6 10:47:44 mutilate info: : [474935.805121] usb 6-1.1.2: reset SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:44 mutilate warning: : [474935.816264] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac00
Sep  6 10:47:44 mutilate warning: : [474935.816274] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac48
Sep  6 10:47:44 mutilate warning: : [474935.823611] ready
Sep  6 10:47:44 mutilate notice: : [474935.823835] sd 13:0:0:0: [sdh] 488378645 
4096-byte logical blocks: (2.00 TB/1.81 TiB)
Sep  6 10:47:44 mutilate notice: : [474935.843934] sd 13:0:0:0: [sdh] Write 
Protect is off
Sep  6 10:47:44 mutilate notice: : [474935.844924] sd 13:0:0:0: [sdh] Write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep  6 10:47:44 mutilate notice: : [474935.845268] sd 13:0:0:0: [sdh] 488378645 
4096-byte logical blocks: (2.00 TB/1.81 TiB)
Sep  6 10:47:44 mutilate info: : [474936.190245] usb 6-1.1.2: reset SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:44 mutilate warning: : [474936.201858] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac00
Sep  6 10:47:44 mutilate warning: : [474936.201867] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac48
Sep  6 10:47:44 mutilate info: : [474936.312082] usb 6-1.1.2: reset SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:44 mutilate warning: : [474936.322885] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac00
Sep  6 10:47:44 mutilate warning: : [474936.322895] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep

3.16.2: 2TiB Seagate Expansion Desk apparently still broken with both USB mass storage and UAS: some debugging output

2014-09-07 Thread Nix

I have a brand new Seagate Expansion Desk drive attached to my x86-64
desktop. (I also have a 4TiB model of the same drive, but I haven't even
unboxed it: there seems little point as long as the 2TiB version doesn't
work.) I am seeing apparently the same problem as Alexandre Oliva
reported in https://bugzilla.kernel.org/show_bug.cgi?id=79511. The
drive is USB ID 0bc2:3321, so probably a slightly different model than
Alexandre's 0bc2:3320, but similar enough to be broken it seems.

I've tried it with both the usb-storage driver on xhci and with UAS,
with verbose USB mass storage debugging turned on. Both fail with
different log messages: see below. I haven't tried Alexandre's quirk
yet. (I have USB debugging for the UAS case only, alas, because the
system helpfully autoloaded and used that module when I was trying to
replicate the usb-storage-only failure, sigh. Autoloading is annoying
sometimes! :) )

It's attached via an 8-port powered USB-3 hub marked only with the logo
'U-speed', possibly USB id 2109:0811.

Note: this system has strange boot messages at startup which may relate
to this no-name USB hub:

Sep  7 18:06:05 mutilate info: : [   33.960925] usb 6-2: reset SuperSpeed USB 
device number 3 using xhci_hcd
Sep  7 18:06:05 mutilate warning: : [   33.974576] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b418cc0
Sep  7 18:06:05 mutilate warning: : [   33.976995] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b418d08

While the system is running with mass storage debugging turned on, I see
a constant pulse of

Sep  7 17:45:27 mutilate warning: : [  152.048083]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.049890] (unknown ASC/ASCQ)
Sep  7 17:45:27 mutilate warning: : [  152.051075]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.053181] (unknown ASC/ASCQ)
Sep  7 17:45:27 mutilate warning: : [  152.054378]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.056299] (unknown ASC/ASCQ)
Sep  7 17:45:27 mutilate warning: : [  152.057485]  00 00 00 00 00 00
Sep  7 17:45:27 mutilate warning: : [  152.059434] (unknown ASC/ASCQ)
Sep  7 17:45:28 mutilate warning: : [  152.559807]  4a 01 00 00 10 00 00 00 08 
00

which may or may not relate to this device: it's definitely not directly
associated with the failing device though, since it's emitted even when
that device is off and unplugged. (I *do* have an always-mounted but
generally idle USB mass storage device here, a 2TiB WD My Book Essential
drive, USB ID 1058:1140: it may relate to that.)

Anyway: to the problem. When mounting the Seagate Expansion Desk drive
with usb-storage, I see:

Sep  6 10:47:39 mutilate info: : [474931.354699] usb 6-1.1.2: new SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:39 mutilate info: : [474931.367533] usb-storage 6-1.1.2:1.0: USB 
Mass Storage device detected
Sep  6 10:47:39 mutilate info: : [474931.367717] scsi13 : usb-storage 
6-1.1.2:1.0
Sep  6 10:47:40 mutilate notice: : [474932.368828] scsi 13:0:0:0: Direct-Access 
Seagate  Expansion Desk   0604 PQ: 0 ANSI: 6
Sep  6 10:47:40 mutilate notice: : [474932.371900] sd 13:0:0:0: [sdh] Spinning 
up disk...
Sep  6 10:47:44 mutilate warning: : [474933.371782] ...
Sep  6 10:47:44 mutilate info: : [474935.805121] usb 6-1.1.2: reset SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:44 mutilate warning: : [474935.816264] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac00
Sep  6 10:47:44 mutilate warning: : [474935.816274] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac48
Sep  6 10:47:44 mutilate warning: : [474935.823611] ready
Sep  6 10:47:44 mutilate notice: : [474935.823835] sd 13:0:0:0: [sdh] 488378645 
4096-byte logical blocks: (2.00 TB/1.81 TiB)
Sep  6 10:47:44 mutilate notice: : [474935.843934] sd 13:0:0:0: [sdh] Write 
Protect is off
Sep  6 10:47:44 mutilate notice: : [474935.844924] sd 13:0:0:0: [sdh] Write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep  6 10:47:44 mutilate notice: : [474935.845268] sd 13:0:0:0: [sdh] 488378645 
4096-byte logical blocks: (2.00 TB/1.81 TiB)
Sep  6 10:47:44 mutilate info: : [474936.190245] usb 6-1.1.2: reset SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:44 mutilate warning: : [474936.201858] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac00
Sep  6 10:47:44 mutilate warning: : [474936.201867] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac48
Sep  6 10:47:44 mutilate info: : [474936.312082] usb 6-1.1.2: reset SuperSpeed 
USB device number 9 using xhci_hcd
Sep  6 10:47:44 mutilate warning: : [474936.322885] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 8800c259ac00
Sep  6 10:47:44 mutilate warning: : [474936.322895] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep

Looks like a broken hub? (was Re: 3.16.2: 2TiB Seagate Expansion Desk apparently still broken with both USB mass storage and UAS: some debugging output)

2014-09-07 Thread Nix

On 7 Sep 2014, Alan Stern spake thusly:

 On Sun, 7 Sep 2014, Nix wrote:

 I have a brand new Seagate Expansion Desk drive attached to my x86-64
 desktop. (I also have a 4TiB model of the same drive, but I haven't even
 unboxed it: there seems little point as long as the 2TiB version doesn't
 work.) I am seeing apparently the same problem as Alexandre Oliva
 reported in https://bugzilla.kernel.org/show_bug.cgi?id=79511. The
 drive is USB ID 0bc2:3321, so probably a slightly different model than
 Alexandre's 0bc2:3320, but similar enough to be broken it seems.
 
 I've tried it with both the usb-storage driver on xhci and with UAS,
 with verbose USB mass storage debugging turned on. Both fail with
 different log messages: see below. I haven't tried Alexandre's quirk
 yet. (I have USB debugging for the UAS case only, alas, because the
 system helpfully autoloaded and used that module when I was trying to
 replicate the usb-storage-only failure, sigh. Autoloading is annoying
 sometimes! :) )

 ...

 I'm happy to do whatever further debugging people may deem necessary:
 this system has up-to-date backups and is easy to reboot and try new
 kernels on, and the drive is brand new and completely empty so it's
 pretty much impossible to mess up!

 Please post a usbmon trace showing what happens when the drive binds to 
 usb-storage.  To prevent extraneous data from cluttering the trace, you 
 should unplug as many of the other USB devices on the same bus as 
 possible before starting the trace.

Done. I dropped the no-name hub out of the equation by taking over my
existing backup device's USB link (which also meant I could figure out
the bus number, since it doesn't change while the system is running :) ).

And... now it works, at least well enough to get a device file. So it's
not the disk that's at fault: it's the no-name hub! (Which is, I think,
USB ID 2109:0811 -- at least two instances of this disappear when I
unplug the hub.)

Attached, usbmon output from a successful negotiation, and a failed one
via the hub.

I am more than slightly annoyed. This machine only *has* two physical
USB3 ports, both on the same bus, and they're almost totally
inaccessible round the back, because of course you'd want your fastest
ports to be hardest to reach. The whole point of this hub was to fix
that stupid case-design decision.

So... anyone got any suggestions for a USB-3 hub I might buy that
doesn't mess things up for devices plugged into it? Do I just need one
new enough that it supports UAS? (And... how on earth can I tell before
buying?)

Sigh. Hardware sucks. (But maybe you can quirk around this...)

88041c19a3c0 1246876544 C Ii:6:001:1 0:2048 1 = 04
88041c19a3c0 1246876553 S Ii:6:001:1 -115:2048 4 
880405a47480 1246876588 S Ci:6:001:0 s a3 00  0002 0004 4 
880405a47480 1246876613 C Ci:6:001:0 0 4 = 03020100
8804058ced80 1246876646 S Co:6:001:0 s 23 01 0010 0002  0
8804058ced80 124687 C Co:6:001:0 0 0
8803508b8600 1246876690 S Ci:6:001:0 s a3 00  0002 0004 4 
8803508b8600 1246876712 C Ci:6:001:0 0 4 = 0302
880405a470c0 1246902105 S Ci:6:001:0 s a3 00  0002 0004 4 
880405a470c0 1246902117 C Ci:6:001:0 0 4 = 0302
88009bacd900 1246928126 S Ci:6:001:0 s a3 00  0002 0004 4 
88009bacd900 1246928143 C Ci:6:001:0 0 4 = 0302
8803508b83c0 1246954116 S Ci:6:001:0 s a3 00  0002 0004 4 
8803508b83c0 1246954127 C Ci:6:001:0 0 4 = 0302
8803508b8480 1246980134 S Ci:6:001:0 s a3 00  0002 0004 4 
8803508b8480 1246980146 C Ci:6:001:0 0 4 = 0302
8803508b8e40 1246980371 S Ci:6:001:0 s a3 00  0002 0004 4 
8803508b8e40 1246980383 C Ci:6:001:0 0 4 = 0302
880405a479c0 1246980425 S Co:6:001:0 s 23 03 0004 0002  0
880405a479c0 1246980440 C Co:6:001:0 0 0
8804058cef00 1247031101 S Ci:6:001:0 s a3 00  0002 0004 4 
8804058cef00 1247031121 C Ci:6:001:0 0 4 = 03021000
8803508b8e40 1247082123 S Co:6:001:0 s 23 01 0014 0002  0
8803508b8e40 1247082146 C Co:6:001:0 0 0
880405a47d80 1247082177 S Co:6:001:0 s 23 01 001d 0002  0
880405a47d80 1247082201 C Co:6:001:0 0 0
8804058cef00 1247082228 S Co:6:001:0 s 23 01 0019 0002  0
8804058cef00 1247082241 C Co:6:001:0 0 0
8803508b8480 1247082284 S Co:6:001:0 s 23 01 0010 0002  0
8803508b8480 1247082312 C Co:6:001:0 0 0
880405a47d80 1247082345 S Ci:6:001:0 s a3 00  0002 0004 4 
880405a47d80 1247082356 C Ci:6:001:0 0 4 = 0302
8803508b8600 1247096118 S Ci:6:010:0 s 80 06 0100  0008 8 
8803508b8600 1247096214 C Ci:6:010:0 0 8 = 12010003 0009
8800b85e1840 1247096248 S Ci:6:010:0 s 80 06 0100  0012 18 
8800b85e1840 1247096339 C Ci:6:010:0 0 18 = 12010003 0009 c20b2133 
00010203 0101
8800b85e1840 1247096382 S Ci:6:010:0 s 80 06 0f00  0005 5 
8800b85e1840 1247096492 C Ci:6:010:0 0 5 = 050f1600 02
8800b85e1840 1247096534 S Ci:6:010:0 s 80 06 0f00  0016 22

Re: Looks like a broken hub? (was Re: 3.16.2: 2TiB Seagate Expansion Desk apparently still broken with both USB mass storage and UAS: some debugging output)

2014-09-07 Thread Nix

[linux-scsi dropped, this is not a scsi or uas problem.]

On 7 Sep 2014, n...@esperi.org.uk stated:
 And... now it works, at least well enough to get a device file. So it's
 not the disk that's at fault: it's the no-name hub! (Which is, I think,
 USB ID 2109:0811 -- at least two instances of this disappear when I
 unplug the hub.)

Confirmed. Plugging a known-good (non-UAS) disk into the questionable
hub yields this mass of screaming:

Sep  7 23:19:29 mutilate info: : [  161.026517] usb 6-1.1.2: new SuperSpeed USB 
device number 5 using xhci_hcd
Sep  7 23:19:29 mutilate info: : [  161.041767] usb-storage 6-1.1.2:1.0: USB 
Mass Storage device detected
Sep  7 23:19:29 mutilate info: : [  161.043404] scsi8 : usb-storage 6-1.1.2:1.0
Sep  7 23:19:30 mutilate notice: : [  162.046725] scsi 8:0:0:0: Direct-Access   
  WD   My Book 1140 1012 PQ: 0 ANSI: 6
Sep  7 23:19:30 mutilate notice: : [  162.048476] scsi 8:0:0:1: Enclosure   
  WD   SES Device   1012 PQ: 0 ANSI: 6
Sep  7 23:19:30 mutilate notice: : [  162.056190] sd 8:0:0:0: [sdc] Spinning up 
disk...
Sep  7 23:19:30 fold warning: : [198061.014106] packet denied IN=bdsl OUT= 
MAC=00:00:24:cb:c6:a2:50:67:f0:8c:bf:8f:08:00 SRC=109.120.181.179 
DST=81.187.191.133 LEN=28 TOS=0x00 PREC=0x00 TTL=249 ID=32820 PROTO=UDP SPT=54
642 DPT=623 LEN=8
Sep  7 23:19:39 mutilate warning: : [  163.058746] .ready
Sep  7 23:19:39 mutilate notice: : [  171.085320] sd 8:0:0:0: [sdc] 3906963456 
512-byte logical blocks: (2.00 TB/1.81 TiB)
Sep  7 23:19:39 mutilate notice: : [  171.087788] sd 8:0:0:0: [sdc] Write 
Protect is off
Sep  7 23:19:39 mutilate err: : [  171.090237] sd 8:0:0:0: [sdc] No Caching 
mode page found
Sep  7 23:19:39 mutilate err: : [  171.091720] sd 8:0:0:0: [sdc] Assuming drive 
cache: write through
Sep  7 23:19:39 mutilate info: : [  171.113631]  sdc: sdc1
Sep  7 23:19:39 mutilate notice: : [  171.117370] sd 8:0:0:0: [sdc] Attached 
SCSI disk
Sep  7 23:19:40 mutilate info: : [  171.525994] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:40 mutilate warning: : [  171.538771] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:40 mutilate warning: : [  171.540317] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:40 mutilate info: : [  171.807110] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:40 mutilate warning: : [  171.819716] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:40 mutilate warning: : [  171.821277] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:40 mutilate info: : [  172.082789] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:40 mutilate warning: : [  172.095409] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:40 mutilate warning: : [  172.097043] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:50 fold warning: : [198080.980957] packet denied IN=bdsl OUT= 
MAC=00:00:24:cb:c6:a2:50:67:f0:8c:bf:8f:08:00 SRC=71.6.165.200 
DST=81.187.191.133 LEN=40 TOS=0x00 PREC=0x00 TTL=111 ID=8287 PROTO=TCP SPT=7987 
D
PT=7071 WINDOW=60597 RES=0x00 SYN URGP=0
Sep  7 23:19:56 mutilate info: : [  188.333974] EXT4-fs (sdc1): mounted 
filesystem without journal. Opts: (null)
Sep  7 23:19:59 mutilate info: : [  190.584332] usb 6-1.1.2: reset SuperSpeed 
USB device number 5 using xhci_hcd
Sep  7 23:19:59 mutilate warning: : [  190.597220] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb40
Sep  7 23:19:59 mutilate warning: : [  190.598996] xhci_hcd :04:00.0: xHCI 
xhci_drop_endpoint called with disabled ep 88041b9efb88
Sep  7 23:19:59 mutilate info: : [  190.602042] sd 8:0:0:0: [sdc] Unhandled 
error code
Sep  7 23:19:59 mutilate info: : [  190.603677] sd 8:0:0:0: [sdc]
Sep  7 23:19:59 mutilate warning: : [  190.605251] Result: hostbyte=0x07 
driverbyte=0x00
Sep  7 23:19:59 mutilate info: : [  190.606861] sd 8:0:0:0: [sdc] CDB:
Sep  7 23:19:59 mutilate warning: : [  190.608456] cdb[0]=0x28: 28 00 27 00 0c 
18 00 00 f0 00
Sep  7 23:19:59 mutilate err: : [  190.610124] end_request: I/O error, dev sdc, 
sector 654314520

So either the hub is shagged, or the new USB extension cable that is the
only way either of these drives can physically reach the hub is shagged.
You'd think they wouldn't make hubs so bad that USB mass storage didn't
work, so maybe it's the cable?

I'll see if the new drive actually works well enough to not get horribly
corrupted doing a backup tomorrow :) if it does, I guess Alexandre's bug
is kind of fixed after all, maybe? At least with my slightly different
drive.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message

Re: [3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-09-05 Thread Nix

On 5 Sep 2014, Oliver Neukum verbalised:

> On Fri, 2014-09-05 at 00:40 +0100, Nix wrote:
>> I'm working around this confusing morass by rebooting into each test
>> kernel, unplugging and replugging the entropy key if it was fubared,
>> then rebooting into the same kernel again and seeing if it was still
>> fubared. But this is not terribly fast, particularly not on a headless
>> compact-flash-based Geode box which doesn't even complete booting
>> without the entropy source which this bug cuts off :) so it'll be
>> sometime tomorrow before I can get this bisection done, I'm afraid.
>
> Ugh. My sympathies. I cannot suggest a better method, I am afraid.

Well, that method doesn't work. I've found pairs of kernels (e.g.
59a3d4c3631e553357b7305dc09db1990aa6757c and
b05d59dfceaea72565b1648af929b037b0f96d7f) where each kernel works on its
own (rebooting from that kernel into the same kernel keeps a working
key, so I would normally assume that each kernel is OK) but rebooting
from the first into the second yields a broken one if it was working
before (so one of them must, in fact, be broken, but I have no clue
which one).

So I can't figure out how to bisect this.

Any suggestions as to what failure-test I might use, or what other
methods I might use to figure out what's going wrong? Not knowing
anything about USB doesn't help here. I don't know for sure that this is
a cdc-acm problem -- bisecting just the cdc-acm driver was fruitless --
so it might be something more generally USBish.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-09-05 Thread Nix

On 5 Sep 2014, Oliver Neukum verbalised:

 On Fri, 2014-09-05 at 00:40 +0100, Nix wrote:
 I'm working around this confusing morass by rebooting into each test
 kernel, unplugging and replugging the entropy key if it was fubared,
 then rebooting into the same kernel again and seeing if it was still
 fubared. But this is not terribly fast, particularly not on a headless
 compact-flash-based Geode box which doesn't even complete booting
 without the entropy source which this bug cuts off :) so it'll be
 sometime tomorrow before I can get this bisection done, I'm afraid.

 Ugh. My sympathies. I cannot suggest a better method, I am afraid.

Well, that method doesn't work. I've found pairs of kernels (e.g.
59a3d4c3631e553357b7305dc09db1990aa6757c and
b05d59dfceaea72565b1648af929b037b0f96d7f) where each kernel works on its
own (rebooting from that kernel into the same kernel keeps a working
key, so I would normally assume that each kernel is OK) but rebooting
from the first into the second yields a broken one if it was working
before (so one of them must, in fact, be broken, but I have no clue
which one).

So I can't figure out how to bisect this.

Any suggestions as to what failure-test I might use, or what other
methods I might use to figure out what's going wrong? Not knowing
anything about USB doesn't help here. I don't know for sure that this is
a cdc-acm problem -- bisecting just the cdc-acm driver was fruitless --
so it might be something more generally USBish.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-09-04 Thread Nix

On 1 Sep 2014, Oliver Neukum stated:

>
>> I'll do a bisection of the cdc-acm changes since 3.15 tomorrow night and
>> see if I can find the commit at fault.
>
> Thank you for the report. Please let me know the results of your
> bisection.

Bisection underway (fifth attempt -- I *may* have characterized it well
enough after a few hours of thrashing at it to bisect accurately this
time).

Some more random info.

btw, when the Entropy Key has ended up in a messed up state due to this
bug, we sometimes see

[2.330158] usb 2-1: new full-speed USB device number 2 using ohci-pci
[2.552465] usb 2-1: device descriptor read/64, error -62
[2.870142] usb 2-1: device descriptor read/64, error -62
[3.190150] usb 2-1: new full-speed USB device number 3 using ohci-pci
[3.410137] usb 2-1: device descriptor read/64, error -62
[3.740142] usb 2-1: device descriptor read/64, error -62
[4.060146] usb 2-1: new full-speed USB device number 4 using ohci-pci
[4.520133] usb 2-1: device not accepting address 4, error -62
[4.730139] usb 2-1: new full-speed USB device number 5 using ohci-pci
[5.180117] usb 2-1: device not accepting address 5, error -62
[5.215194] hub 2-0:1.0: unable to enumerate USB device on port 1

when starting up a working kernel (the key then doesn't work until
physically disconnected and reconnected again).

More generally, the problem may be at *shutdown* -- something goes wrong
during link suspension or something, such that the link never comes up
again until physically reconnected. So a straight bisect is misleading
-- the error may have been in the *last* kernel tested -- and even then,
some kernels (e.g. the 3.15.0 merge base) appear capable of making it
work fine. But even this is not consistent: sometimes a kernel that
works fine if you repeatedly reboot it (such as 3.15) malfunctions when
you reboot into 3.16 -- but sometimes a newly plugged USB key on a 3.16
kernel malfunctions upon reboot, even if you reboot into a working
kernel such as 3.15 (and it then proceeds to work indefinitely if you
unplug and replug it and stick with 3.15.x, but upon rebooting into
3.16.x it goes wrong again).

So sometimes a faulty kernel makes the key go wrong when you restart
into another kernel (faulty or not), and sometimes it makes a key go
wrong when it is restarted into. There doesn't seem to be any
consistency to this that I've spotted, at least not yet.

Upon physical reconnection, the USB key works again, even on afflicted
kernels.

I'm working around this confusing morass by rebooting into each test
kernel, unplugging and replugging the entropy key if it was fubared,
then rebooting into the same kernel again and seeing if it was still
fubared. But this is not terribly fast, particularly not on a headless
compact-flash-based Geode box which doesn't even complete booting
without the entropy source which this bug cuts off :) so it'll be
sometime tomorrow before I can get this bisection done, I'm afraid.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-09-04 Thread Nix

On 1 Sep 2014, Oliver Neukum stated:


 I'll do a bisection of the cdc-acm changes since 3.15 tomorrow night and
 see if I can find the commit at fault.

 Thank you for the report. Please let me know the results of your
 bisection.

Bisection underway (fifth attempt -- I *may* have characterized it well
enough after a few hours of thrashing at it to bisect accurately this
time).

Some more random info.

btw, when the Entropy Key has ended up in a messed up state due to this
bug, we sometimes see

[2.330158] usb 2-1: new full-speed USB device number 2 using ohci-pci
[2.552465] usb 2-1: device descriptor read/64, error -62
[2.870142] usb 2-1: device descriptor read/64, error -62
[3.190150] usb 2-1: new full-speed USB device number 3 using ohci-pci
[3.410137] usb 2-1: device descriptor read/64, error -62
[3.740142] usb 2-1: device descriptor read/64, error -62
[4.060146] usb 2-1: new full-speed USB device number 4 using ohci-pci
[4.520133] usb 2-1: device not accepting address 4, error -62
[4.730139] usb 2-1: new full-speed USB device number 5 using ohci-pci
[5.180117] usb 2-1: device not accepting address 5, error -62
[5.215194] hub 2-0:1.0: unable to enumerate USB device on port 1

when starting up a working kernel (the key then doesn't work until
physically disconnected and reconnected again).

More generally, the problem may be at *shutdown* -- something goes wrong
during link suspension or something, such that the link never comes up
again until physically reconnected. So a straight bisect is misleading
-- the error may have been in the *last* kernel tested -- and even then,
some kernels (e.g. the 3.15.0 merge base) appear capable of making it
work fine. But even this is not consistent: sometimes a kernel that
works fine if you repeatedly reboot it (such as 3.15) malfunctions when
you reboot into 3.16 -- but sometimes a newly plugged USB key on a 3.16
kernel malfunctions upon reboot, even if you reboot into a working
kernel such as 3.15 (and it then proceeds to work indefinitely if you
unplug and replug it and stick with 3.15.x, but upon rebooting into
3.16.x it goes wrong again).

So sometimes a faulty kernel makes the key go wrong when you restart
into another kernel (faulty or not), and sometimes it makes a key go
wrong when it is restarted into. There doesn't seem to be any
consistency to this that I've spotted, at least not yet.

Upon physical reconnection, the USB key works again, even on afflicted
kernels.

I'm working around this confusing morass by rebooting into each test
kernel, unplugging and replugging the entropy key if it was fubared,
then rebooting into the same kernel again and seeing if it was still
fubared. But this is not terribly fast, particularly not on a headless
compact-flash-based Geode box which doesn't even complete booting
without the entropy source which this bug cuts off :) so it'll be
sometime tomorrow before I can get this bisection done, I'm afraid.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-08-31 Thread Nix

So I upgraded to 3.16.1 and found that the Simtec Entropy Key (a cdc-acm
device) was no longer operational:

fold:~# ekeydctl stats 1
BytesRead=0
BytesWritten=0
ConnectionNonces=0
ConnectionPackets=0
ConnectionRekeys=0
ConnectionResets=0
ConnectionTime=65
EntropyRate=0
FipsFrameRate=0
FrameByteLast=0
FramesOk=0
FramingErrors=0
KeyDbsdShannonPerByteL=0
KeyDbsdShannonPerByteR=0
KeyEnglishBadness=No failure
KeyRawBadness=0
KeyRawShannonPerByteL=0
KeyRawShannonPerByteR=0
KeyRawShannonPerByteX=0
KeyShortBadness=efm_ok
KeyTemperatureC=-273.15
KeyTemperatureF=-459.67
KeyTemperatureK=0
KeyVoltage=0
PacketErrors=0
PacketOK=0
ReadRate=0
TotalEntropy=0
WriteRate=0

This device streams data continuously at at rate of several KiB/s, so
normally we would never expect to see a report of zero bytes read or
written if the key were functional (nor, indeed, a key temperature of
absolute zero!)

It appears that cdc-acm has broken such that no data is received from
this device any more (though it's still being detected, according to the
dmesg log). Something goes very askew with the entropy key -- even after
rebooting back to an earlier kernel, a physical disconnection and
reconnection of the entropy key is needed to make it work again. Whether
this is some sort of cdc-acm-level protocol problem, or a key-level
problem caused by interrupted communication, I have no clue.

3.15.8 works fine.

I'll do a bisection of the cdc-acm changes since 3.15 tomorrow night and
see if I can find the commit at fault.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[3.16.1 REGRESSION]: Simtec Entropy Key (cdc-acm) broken in 3.16

2014-08-31 Thread Nix

So I upgraded to 3.16.1 and found that the Simtec Entropy Key (a cdc-acm
device) was no longer operational:

fold:~# ekeydctl stats 1
BytesRead=0
BytesWritten=0
ConnectionNonces=0
ConnectionPackets=0
ConnectionRekeys=0
ConnectionResets=0
ConnectionTime=65
EntropyRate=0
FipsFrameRate=0
FrameByteLast=0
FramesOk=0
FramingErrors=0
KeyDbsdShannonPerByteL=0
KeyDbsdShannonPerByteR=0
KeyEnglishBadness=No failure
KeyRawBadness=0
KeyRawShannonPerByteL=0
KeyRawShannonPerByteR=0
KeyRawShannonPerByteX=0
KeyShortBadness=efm_ok
KeyTemperatureC=-273.15
KeyTemperatureF=-459.67
KeyTemperatureK=0
KeyVoltage=0
PacketErrors=0
PacketOK=0
ReadRate=0
TotalEntropy=0
WriteRate=0

This device streams data continuously at at rate of several KiB/s, so
normally we would never expect to see a report of zero bytes read or
written if the key were functional (nor, indeed, a key temperature of
absolute zero!)

It appears that cdc-acm has broken such that no data is received from
this device any more (though it's still being detected, according to the
dmesg log). Something goes very askew with the entropy key -- even after
rebooting back to an earlier kernel, a physical disconnection and
reconnection of the entropy key is needed to make it work again. Whether
this is some sort of cdc-acm-level protocol problem, or a key-level
problem caused by interrupted communication, I have no clue.

3.15.8 works fine.

I'll do a bisection of the cdc-acm changes since 3.15 tomorrow night and
see if I can find the commit at fault.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Unix-domain sockets hanging on 3.14.x (was Re: possible 3.14.1 lockd problem (?) causing nfsd hangs)

2014-04-28 Thread Nix

On 28 Apr 2014, Hannes Frederic Sowa uttered the following:

> On Mon, Apr 28, 2014 at 04:35:38PM +0100, Nix wrote:
>> /proc/$pid/stack of the two communicating ssh daemons was instructive:
>> 
>> [] unix_wait_for_peer+0x9f/0xbc
>> [] unix_dgram_sendmsg+0x41b/0x534
>
> This one is a dgram socket...
>
>> [] sock_sendmsg+0x84/0x9e
>> [] SyS_sendto+0x10e/0x13f
>> [] system_call_fastpath+0x16/0x1b
>> [] 0x
>> spindle:/var/log.real/by-facility# cat /proc/5941/stack
>> [] unix_stream_recvmsg+0x289/0x6d5
>
> ...and that's a stream receiver.
>
>> [] sock_aio_read.part.12+0xf0/0xff
>> [] sock_aio_read+0x1c/0x28
>> [] do_sync_read+0x59/0x78
>> [] vfs_read+0xa2/0x13f
>> [] SyS_read+0x47/0x8b
>> [] tracesys+0xd0/0xd5
>> [] 0x
>
> Are you sure those are the communicating tasks?

Normally I'd say yes. This time I'd say probably not. I am very much not
at my best right now.

I'll wait for it to implode and try the same thing again. On past form I
won't have to wait very many days...

(this time, there was a six-hour interval between boot and
misbehaviour. Whatever the misbehaviour *is*.)

One more instance I didn't share because I only have one end of it:
starting up ISC dhcpd 4.2.4 also hung unexpectedly (which it did not
after rebooting):

[] unix_wait_for_peer+0x9f/0xbc
[] unix_dgram_sendmsg+0x41b/0x534
[] sock_sendmsg+0x84/0x9e
[] SyS_sendto+0x10e/0x13f
[] system_call_fastpath+0x16/0x1b
[] 0x

The common factor in these hangs is definitely unix_wait_for_peer. The
question is, what are they talking to? (In theory this could be a
userspace bug... but what common userspace factor is there between the
NFS server, OpenSSH, and ISC dhcpd? If it helps, named pipes may well be
suffering from the same malaise: /sbin/shutdown from sysvinit also hangs
and I have to /sbin/reboot -f instead. I'll check next time it hits, but
I'll bet it's trying to talk over /dev/initctl and getting nowhere.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Unix-domain sockets hanging on 3.14.x (was Re: possible 3.14.1 lockd problem (?) causing nfsd hangs)

2014-04-28 Thread Nix

On 25 Apr 2014, n...@esperi.org.uk verbalised:

> (This is extremely speculative: I don't really know how to debug
> problems like this, particularly not when the only machine I can
> reproduce the problem on is the headless one at the centre of my
> network, and with its NFS server stalled everything else blocks when
> accessing home directories, so my desktop hangs shortly afterwards.)

I have a bit more on this now: though still not really enough for a
diagnosis it might be enough for someone who knows the area to get an
aha! moment.

> A few days ago I upgraded my local net to 3.14.1. A little over two days
> later my desktop and firewall both hung, reporting this in dmesg:
>
> [65403.845099] xprt_adjust_timeout: rq_timeout = 0!
> [65403.845109] lockd: server spindle not responding, still trying
> [65403.845169] xprt_adjust_timeout: rq_timeout = 0!
> [65463.971767] xprt_adjust_timeout: rq_timeout = 0!
> [65463.971816] xprt_adjust_timeout: rq_timeout = 0!
> [65524.098447] xprt_adjust_timeout: rq_timeout = 0!
> [65524.098481] xprt_adjust_timeout: rq_timeout = 0!
> [65584.225157] xprt_adjust_timeout: rq_timeout = 0!
> [65644.351789] xprt_adjust_timeout: rq_timeout = 0!
> [65704.478477] xprt_adjust_timeout: rq_timeout = 0!
> [65722.388524] nfs: server spindle not responding, still trying
> [65735.917032] nfs: server spindle not responding, still trying
> [65735.920025] nfs: server spindle not responding, still trying
> [65752.227986] nfs: server spindle not responding, still trying
>
> I cursed, rebooted the server into 3.13.6, where it had been happily
> running for a month plus with 3.13.x clients, and resolved to look at it
> further this weekend.
>
> A few minutes ago (a little over two days since that last reboot) the
> whole thing hung again. This time the hang started with my home cinema,
> an XBMC-on-Android box which accesses NFS from userspace via libnfs.
> However, a little investigation showed that access to everything else
> from every machine on the network was also hanging, little by little and
> export by export: access to my home directory from my desktop had
> already hung and other things were hanging piece by piece (quite
> possibly an artifact of caching, it is possible that everything hung at
> once). ps showed that all the nfsd threads on the server were blocked in
> one of svc_get_next_xprt, cache_wait_req.isra.9, poll_schedule_timeout,
> or unix_wait_for_peer (though this last I'm fairly sure is normal!).

This is the smoking gun, it turns out: the problem is not with NFS at
all. It hit again, and all sorts of things failed, including new ssh
connections and even new connections over an ssh master: but ps on the
afflicted server revealed the ssh server stuck in [net] mode, when it
opens an AF_UNIX socket to talk to its nonprivileged child.

/proc/$pid/stack of the two communicating ssh daemons was instructive:

[] unix_wait_for_peer+0x9f/0xbc
[] unix_dgram_sendmsg+0x41b/0x534
[] sock_sendmsg+0x84/0x9e
[] SyS_sendto+0x10e/0x13f
[] system_call_fastpath+0x16/0x1b
[] 0x
spindle:/var/log.real/by-facility# cat /proc/5941/stack
[] unix_stream_recvmsg+0x289/0x6d5
[] sock_aio_read.part.12+0xf0/0xff
[] sock_aio_read+0x1c/0x28
[] do_sync_read+0x59/0x78
[] vfs_read+0xa2/0x13f
[] SyS_read+0x47/0x8b
[] tracesys+0xd0/0xd5
[] 0x

So the underlying problem here is that Unix-domain sockets are hanging,
with each end waiting for the other indefinitely: this obviously causes
rapid NFS problems because of the upcalls done by the NFS server, and
once all the nfsd threads are stuck waiting for an upcall that never
completes, NFS goes down. This is system-wide: once one process is hit
by it, they all are. To me that screams a stuck lock somewhere, but I
don't know enough about the networking layer to know where that might
be.

Since I saw this hang again when I switched back to 3.13.x it is
just barely possible that this bug predates the 3.14 cycle, though it is
strange in that case that I never saw it in months of running 3.13.6
before the 3.14.x upgrade.

> This is clearly not a 3.14 server bug -- it's more likely a bug

Ah, it's nice to make mistakes in public like this :)

> triggered by something one of the 3.14 *clients* is doing. And one of
> the clients is doing something a little unusual: email delivery into
> home directories (using sendmail) via NFSv3, which means fairly heavy
> and bursty lockd use. I have no proof nor even evidence that this is
> implicated, but it makes my ears itch.

That would be the upcalls.

> This time I also got a hung task warning from one of the clients, though
> quite likely not the client that triggered the crash:
>
> [298275.185240] INFO: task tee:1312 blocked for more than 120 seconds.
> [298275.185252]   Not tainted 3.14.1-05371-gf2552aa-dirty #1
> [298275.185256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [298275.185259] tee D  0  1312   1309 
> 0x
>

Unix-domain sockets hanging on 3.14.x (was Re: possible 3.14.1 lockd problem (?) causing nfsd hangs)

2014-04-28 Thread Nix

On 25 Apr 2014, n...@esperi.org.uk verbalised:

 (This is extremely speculative: I don't really know how to debug
 problems like this, particularly not when the only machine I can
 reproduce the problem on is the headless one at the centre of my
 network, and with its NFS server stalled everything else blocks when
 accessing home directories, so my desktop hangs shortly afterwards.)

I have a bit more on this now: though still not really enough for a
diagnosis it might be enough for someone who knows the area to get an
aha! moment.

 A few days ago I upgraded my local net to 3.14.1. A little over two days
 later my desktop and firewall both hung, reporting this in dmesg:

 [65403.845099] xprt_adjust_timeout: rq_timeout = 0!
 [65403.845109] lockd: server spindle not responding, still trying
 [65403.845169] xprt_adjust_timeout: rq_timeout = 0!
 [65463.971767] xprt_adjust_timeout: rq_timeout = 0!
 [65463.971816] xprt_adjust_timeout: rq_timeout = 0!
 [65524.098447] xprt_adjust_timeout: rq_timeout = 0!
 [65524.098481] xprt_adjust_timeout: rq_timeout = 0!
 [65584.225157] xprt_adjust_timeout: rq_timeout = 0!
 [65644.351789] xprt_adjust_timeout: rq_timeout = 0!
 [65704.478477] xprt_adjust_timeout: rq_timeout = 0!
 [65722.388524] nfs: server spindle not responding, still trying
 [65735.917032] nfs: server spindle not responding, still trying
 [65735.920025] nfs: server spindle not responding, still trying
 [65752.227986] nfs: server spindle not responding, still trying

 I cursed, rebooted the server into 3.13.6, where it had been happily
 running for a month plus with 3.13.x clients, and resolved to look at it
 further this weekend.

 A few minutes ago (a little over two days since that last reboot) the
 whole thing hung again. This time the hang started with my home cinema,
 an XBMC-on-Android box which accesses NFS from userspace via libnfs.
 However, a little investigation showed that access to everything else
 from every machine on the network was also hanging, little by little and
 export by export: access to my home directory from my desktop had
 already hung and other things were hanging piece by piece (quite
 possibly an artifact of caching, it is possible that everything hung at
 once). ps showed that all the nfsd threads on the server were blocked in
 one of svc_get_next_xprt, cache_wait_req.isra.9, poll_schedule_timeout,
 or unix_wait_for_peer (though this last I'm fairly sure is normal!).

This is the smoking gun, it turns out: the problem is not with NFS at
all. It hit again, and all sorts of things failed, including new ssh
connections and even new connections over an ssh master: but ps on the
afflicted server revealed the ssh server stuck in [net] mode, when it
opens an AF_UNIX socket to talk to its nonprivileged child.

/proc/$pid/stack of the two communicating ssh daemons was instructive:

[814e3512] unix_wait_for_peer+0x9f/0xbc
[814e5d48] unix_dgram_sendmsg+0x41b/0x534
[8146618f] sock_sendmsg+0x84/0x9e
[81467f3d] SyS_sendto+0x10e/0x13f
[815770e2] system_call_fastpath+0x16/0x1b
[] 0x
spindle:/var/log.real/by-facility# cat /proc/5941/stack
[814e493a] unix_stream_recvmsg+0x289/0x6d5
[814673e0] sock_aio_read.part.12+0xf0/0xff
[8146740b] sock_aio_read+0x1c/0x28
[811388fd] do_sync_read+0x59/0x78
[81138d8b] vfs_read+0xa2/0x13f
[81139699] SyS_read+0x47/0x8b
[81577259] tracesys+0xd0/0xd5
[] 0x

So the underlying problem here is that Unix-domain sockets are hanging,
with each end waiting for the other indefinitely: this obviously causes
rapid NFS problems because of the upcalls done by the NFS server, and
once all the nfsd threads are stuck waiting for an upcall that never
completes, NFS goes down. This is system-wide: once one process is hit
by it, they all are. To me that screams a stuck lock somewhere, but I
don't know enough about the networking layer to know where that might
be.

Since I saw this hang again when I switched back to 3.13.x it is
just barely possible that this bug predates the 3.14 cycle, though it is
strange in that case that I never saw it in months of running 3.13.6
before the 3.14.x upgrade.

 This is clearly not a 3.14 server bug -- it's more likely a bug

Ah, it's nice to make mistakes in public like this :)

 triggered by something one of the 3.14 *clients* is doing. And one of
 the clients is doing something a little unusual: email delivery into
 home directories (using sendmail) via NFSv3, which means fairly heavy
 and bursty lockd use. I have no proof nor even evidence that this is
 implicated, but it makes my ears itch.

That would be the upcalls.

 This time I also got a hung task warning from one of the clients, though
 quite likely not the client that triggered the crash:

 [298275.185240] INFO: task tee:1312 blocked for more than 120 seconds.
 [298275.185252]   Not tainted 3.14.1-05371-gf2552aa-dirty #1

Re: Unix-domain sockets hanging on 3.14.x (was Re: possible 3.14.1 lockd problem (?) causing nfsd hangs)

2014-04-28 Thread Nix

On 28 Apr 2014, Hannes Frederic Sowa uttered the following:

 On Mon, Apr 28, 2014 at 04:35:38PM +0100, Nix wrote:
 /proc/$pid/stack of the two communicating ssh daemons was instructive:
 
 [814e3512] unix_wait_for_peer+0x9f/0xbc
 [814e5d48] unix_dgram_sendmsg+0x41b/0x534

 This one is a dgram socket...

 [8146618f] sock_sendmsg+0x84/0x9e
 [81467f3d] SyS_sendto+0x10e/0x13f
 [815770e2] system_call_fastpath+0x16/0x1b
 [] 0x
 spindle:/var/log.real/by-facility# cat /proc/5941/stack
 [814e493a] unix_stream_recvmsg+0x289/0x6d5

 ...and that's a stream receiver.

 [814673e0] sock_aio_read.part.12+0xf0/0xff
 [8146740b] sock_aio_read+0x1c/0x28
 [811388fd] do_sync_read+0x59/0x78
 [81138d8b] vfs_read+0xa2/0x13f
 [81139699] SyS_read+0x47/0x8b
 [81577259] tracesys+0xd0/0xd5
 [] 0x

 Are you sure those are the communicating tasks?

Normally I'd say yes. This time I'd say probably not. I am very much not
at my best right now.

I'll wait for it to implode and try the same thing again. On past form I
won't have to wait very many days...

(this time, there was a six-hour interval between boot and
misbehaviour. Whatever the misbehaviour *is*.)

One more instance I didn't share because I only have one end of it:
starting up ISC dhcpd 4.2.4 also hung unexpectedly (which it did not
after rebooting):

[814e3512] unix_wait_for_peer+0x9f/0xbc
[814e5d48] unix_dgram_sendmsg+0x41b/0x534
[8146618f] sock_sendmsg+0x84/0x9e
[81467f3d] SyS_sendto+0x10e/0x13f
[815770e2] system_call_fastpath+0x16/0x1b
[] 0x

The common factor in these hangs is definitely unix_wait_for_peer. The
question is, what are they talking to? (In theory this could be a
userspace bug... but what common userspace factor is there between the
NFS server, OpenSSH, and ISC dhcpd? If it helps, named pipes may well be
suffering from the same malaise: /sbin/shutdown from sysvinit also hangs
and I have to /sbin/reboot -f instead. I'll check next time it hits, but
I'll bet it's trying to talk over /dev/initctl and getting nowhere.)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

possible 3.14.1 lockd problem (?) causing nfsd hangs

2014-04-25 Thread Nix

(This is extremely speculative: I don't really know how to debug
problems like this, particularly not when the only machine I can
reproduce the problem on is the headless one at the centre of my
network, and with its NFS server stalled everything else blocks when
accessing home directories, so my desktop hangs shortly afterwards.)

A few days ago I upgraded my local net to 3.14.1. A little over two days
later my desktop and firewall both hung, reporting this in dmesg:

[65403.845099] xprt_adjust_timeout: rq_timeout = 0!
[65403.845109] lockd: server spindle not responding, still trying
[65403.845169] xprt_adjust_timeout: rq_timeout = 0!
[65463.971767] xprt_adjust_timeout: rq_timeout = 0!
[65463.971816] xprt_adjust_timeout: rq_timeout = 0!
[65524.098447] xprt_adjust_timeout: rq_timeout = 0!
[65524.098481] xprt_adjust_timeout: rq_timeout = 0!
[65584.225157] xprt_adjust_timeout: rq_timeout = 0!
[65644.351789] xprt_adjust_timeout: rq_timeout = 0!
[65704.478477] xprt_adjust_timeout: rq_timeout = 0!
[65722.388524] nfs: server spindle not responding, still trying
[65735.917032] nfs: server spindle not responding, still trying
[65735.920025] nfs: server spindle not responding, still trying
[65752.227986] nfs: server spindle not responding, still trying

I cursed, rebooted the server into 3.13.6, where it had been happily
running for a month plus with 3.13.x clients, and resolved to look at it
further this weekend.

A few minutes ago (a little over two days since that last reboot) the
whole thing hung again. This time the hang started with my home cinema,
an XBMC-on-Android box which accesses NFS from userspace via libnfs.
However, a little investigation showed that access to everything else
from every machine on the network was also hanging, little by little and
export by export: access to my home directory from my desktop had
already hung and other things were hanging piece by piece (quite
possibly an artifact of caching, it is possible that everything hung at
once). ps showed that all the nfsd threads on the server were blocked in
one of svc_get_next_xprt, cache_wait_req.isra.9, poll_schedule_timeout,
or unix_wait_for_peer (though this last I'm fairly sure is normal!).

This is clearly not a 3.14 server bug -- it's more likely a bug
triggered by something one of the 3.14 *clients* is doing. And one of
the clients is doing something a little unusual: email delivery into
home directories (using sendmail) via NFSv3, which means fairly heavy
and bursty lockd use. I have no proof nor even evidence that this is
implicated, but it makes my ears itch.

This time I also got a hung task warning from one of the clients, though
quite likely not the client that triggered the crash:

[298275.185240] INFO: task tee:1312 blocked for more than 120 seconds.
[298275.185252]   Not tainted 3.14.1-05371-gf2552aa-dirty #1
[298275.185256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[298275.185259] tee D  0  1312   1309 0x
[298275.185269]  8804163f7b28 0082 8804163f7fd8 
8804163d8000
[298275.185282]  000128c0 8804163d8000  

[298275.185292]  0490 00010001 00020001 

[298275.185302] Call Trace:
[298275.185314]  [] ? nfs_free_request+0x94/0x94
[298275.185321]  [] schedule+0x73/0x75
[298275.185325]  [] io_schedule+0x8f/0xd6
[298275.185331]  [] nfs_wait_bit_uninterruptible+0xe/0x12
[298275.185336]  [] __wait_on_bit+0x48/0x7a
[298275.185341]  [] out_of_line_wait_on_bit+0x7b/0x86
[298275.185349]  [] ? nfs_free_request+0x94/0x94
[298275.185355]  [] ? autoremove_wake_function+0x34/0x34
[298275.185361]  [] nfs_wait_on_request+0x2b/0x2d
[298275.185366]  [] nfs_updatepage+0x49b/0x522
[298275.185370]  [] nfs_write_end+0xf8/0x269
[298275.185378]  [] generic_file_buffered_write+0x173/0x23c
[298275.185384]  [] __generic_file_aio_write+0x2a8/0x2e0
[298275.185390]  [] generic_file_aio_write+0x58/0xc3
[298275.185395]  [] nfs_file_write+0xd1/0x14e
[298275.185402]  [] do_sync_write+0x59/0x78
[298275.185407]  [] vfs_write+0xc4/0x181
[298275.185412]  [] SyS_write+0x47/0x8b
[298275.185419]  [] system_call_fastpath+0x16/0x1b

However, I suspect this is merely saying that all the server's nfsd
threads were unresponsive, which, of course, we already knew.

Advice on how to debug this would be appreciated. It's not
*intermittent* as such, but waiting two days between crashes is
tiresome.

A few /proc/fs/nfs/exports lines for likely-implicated filesystems
follow (though they're nearly all the same, the thing is lazily
populated and the machine just rebooted so most haven't reappeared yet).
All exported filesystems are ext4.

/home/.spindle.srvr.nix 
fold.srvr.nix(rw,root_squash,async,wdelay,no_subtree_check,fsid=1,uuid=95bd22c2:253c456f:8e36b6cf:b9ecd4ef,sec=1)
/usr/src

possible 3.14.1 lockd problem (?) causing nfsd hangs

2014-04-25 Thread Nix

(This is extremely speculative: I don't really know how to debug
problems like this, particularly not when the only machine I can
reproduce the problem on is the headless one at the centre of my
network, and with its NFS server stalled everything else blocks when
accessing home directories, so my desktop hangs shortly afterwards.)

A few days ago I upgraded my local net to 3.14.1. A little over two days
later my desktop and firewall both hung, reporting this in dmesg:

[65403.845099] xprt_adjust_timeout: rq_timeout = 0!
[65403.845109] lockd: server spindle not responding, still trying
[65403.845169] xprt_adjust_timeout: rq_timeout = 0!
[65463.971767] xprt_adjust_timeout: rq_timeout = 0!
[65463.971816] xprt_adjust_timeout: rq_timeout = 0!
[65524.098447] xprt_adjust_timeout: rq_timeout = 0!
[65524.098481] xprt_adjust_timeout: rq_timeout = 0!
[65584.225157] xprt_adjust_timeout: rq_timeout = 0!
[65644.351789] xprt_adjust_timeout: rq_timeout = 0!
[65704.478477] xprt_adjust_timeout: rq_timeout = 0!
[65722.388524] nfs: server spindle not responding, still trying
[65735.917032] nfs: server spindle not responding, still trying
[65735.920025] nfs: server spindle not responding, still trying
[65752.227986] nfs: server spindle not responding, still trying

I cursed, rebooted the server into 3.13.6, where it had been happily
running for a month plus with 3.13.x clients, and resolved to look at it
further this weekend.

A few minutes ago (a little over two days since that last reboot) the
whole thing hung again. This time the hang started with my home cinema,
an XBMC-on-Android box which accesses NFS from userspace via libnfs.
However, a little investigation showed that access to everything else
from every machine on the network was also hanging, little by little and
export by export: access to my home directory from my desktop had
already hung and other things were hanging piece by piece (quite
possibly an artifact of caching, it is possible that everything hung at
once). ps showed that all the nfsd threads on the server were blocked in
one of svc_get_next_xprt, cache_wait_req.isra.9, poll_schedule_timeout,
or unix_wait_for_peer (though this last I'm fairly sure is normal!).

This is clearly not a 3.14 server bug -- it's more likely a bug
triggered by something one of the 3.14 *clients* is doing. And one of
the clients is doing something a little unusual: email delivery into
home directories (using sendmail) via NFSv3, which means fairly heavy
and bursty lockd use. I have no proof nor even evidence that this is
implicated, but it makes my ears itch.

This time I also got a hung task warning from one of the clients, though
quite likely not the client that triggered the crash:

[298275.185240] INFO: task tee:1312 blocked for more than 120 seconds.
[298275.185252]   Not tainted 3.14.1-05371-gf2552aa-dirty #1
[298275.185256] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[298275.185259] tee D  0  1312   1309 0x
[298275.185269]  8804163f7b28 0082 8804163f7fd8 
8804163d8000
[298275.185282]  000128c0 8804163d8000  

[298275.185292]  0490 00010001 00020001 

[298275.185302] Call Trace:
[298275.185314]  [81234564] ? nfs_free_request+0x94/0x94
[298275.185321]  [81691a4f] schedule+0x73/0x75
[298275.185325]  [81691c49] io_schedule+0x8f/0xd6
[298275.185331]  [81234572] nfs_wait_bit_uninterruptible+0xe/0x12
[298275.185336]  [81691f54] __wait_on_bit+0x48/0x7a
[298275.185341]  [81692001] out_of_line_wait_on_bit+0x7b/0x86
[298275.185349]  [81234564] ? nfs_free_request+0x94/0x94
[298275.185355]  [810be608] ? autoremove_wake_function+0x34/0x34
[298275.185361]  [8123487e] nfs_wait_on_request+0x2b/0x2d
[298275.185366]  [8123899e] nfs_updatepage+0x49b/0x522
[298275.185370]  [8122cb95] nfs_write_end+0xf8/0x269
[298275.185378]  [8112d70e] generic_file_buffered_write+0x173/0x23c
[298275.185384]  [8112e49b] __generic_file_aio_write+0x2a8/0x2e0
[298275.185390]  [8112e52b] generic_file_aio_write+0x58/0xc3
[298275.185395]  [8122d2d3] nfs_file_write+0xd1/0x14e
[298275.185402]  [811714ee] do_sync_write+0x59/0x78
[298275.185407]  [81171a76] vfs_write+0xc4/0x181
[298275.185412]  [811722ae] SyS_write+0x47/0x8b
[298275.185419]  [816952e2] system_call_fastpath+0x16/0x1b

However, I suspect this is merely saying that all the server's nfsd
threads were unresponsive, which, of course, we already knew.

Advice on how to debug this would be appreciated. It's not
*intermittent* as such, but waiting two days between crashes is
tiresome.

A few /proc/fs/nfs/exports lines for likely-implicated filesystems
follow (though they're nearly all the same, the thing is lazily
populated and the machine just rebooted so most haven't reappeared

Re: [PATCH] scsi disk: Use its own buffer for the vpd request

2013-08-31 Thread Nix

On 31 Aug 2013, Greg KH said:
> On Fri, Aug 30, 2013 at 11:01:56AM +0100, Nix wrote:
>> On 1 Aug 2013, Bernd Schubert said:
>> 
>> > Once I noticed that scsi_get_vpd_page() works fine from other function
>> > calls and that it is not 0x89, but already 0x0 that fails fixing it became
>> > easy.
>> >
>> > Nix, any chance you could verify it also works for you?
>> 
>> As an aside, this commit does indeed fix the bug I reported, but it
>> doesn't seem to have gone anywhere, not even into -stable.
>> 
>> Is it held up somehow?
>> 
>> (stable has
>> 
>> commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
>> Author: Martin K. Petersen 
>> Date:   Tue Jul 30 22:58:34 2013 -0400
>> 
>> SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages 
>> is set
>> 
>> which IIRC was eventually found not to be necessary, because this fix
>> works fine instead?)
>> 
>> Possibly I'm misremembering the order of month-old events and Martin's
>> fix was eventually considered better... in which case, sorry for the noise.
>
> Is that other patch even needed anymore, now that Martin's patch is in
> the tree?

My understanding is that this patch is rather better, since Martin's
patch prevents sending of the extended INQUIRY command at all: this one
just uses a reduced buffer size, but can still issue the command. (But I
may be misunderstanding everything.)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] scsi disk: Use its own buffer for the vpd request

2013-08-31 Thread Nix

On 31 Aug 2013, Greg KH said:
 On Fri, Aug 30, 2013 at 11:01:56AM +0100, Nix wrote:
 On 1 Aug 2013, Bernd Schubert said:
 
  Once I noticed that scsi_get_vpd_page() works fine from other function
  calls and that it is not 0x89, but already 0x0 that fails fixing it became
  easy.
 
  Nix, any chance you could verify it also works for you?
 
 As an aside, this commit does indeed fix the bug I reported, but it
 doesn't seem to have gone anywhere, not even into -stable.
 
 Is it held up somehow?
 
 (stable has
 
 commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
 Author: Martin K. Petersen martin.peter...@oracle.com
 Date:   Tue Jul 30 22:58:34 2013 -0400
 
 SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages 
 is set
 
 which IIRC was eventually found not to be necessary, because this fix
 works fine instead?)
 
 Possibly I'm misremembering the order of month-old events and Martin's
 fix was eventually considered better... in which case, sorry for the noise.

 Is that other patch even needed anymore, now that Martin's patch is in
 the tree?

My understanding is that this patch is rather better, since Martin's
patch prevents sending of the extended INQUIRY command at all: this one
just uses a reduced buffer size, but can still issue the command. (But I
may be misunderstanding everything.)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] scsi disk: Use its own buffer for the vpd request

2013-08-30 Thread Nix

On 1 Aug 2013, Bernd Schubert said:

> Once I noticed that scsi_get_vpd_page() works fine from other function
> calls and that it is not 0x89, but already 0x0 that fails fixing it became
> easy.
>
> Nix, any chance you could verify it also works for you?

As an aside, this commit does indeed fix the bug I reported, but it
doesn't seem to have gone anywhere, not even into -stable.

Is it held up somehow?

(stable has

commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
Author: Martin K. Petersen 
Date:   Tue Jul 30 22:58:34 2013 -0400

SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages is 
set

which IIRC was eventually found not to be necessary, because this fix
works fine instead?)

Possibly I'm misremembering the order of month-old events and Martin's
fix was eventually considered better... in which case, sorry for the noise.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] scsi disk: Use its own buffer for the vpd request

2013-08-30 Thread Nix

On 1 Aug 2013, Bernd Schubert said:

 Once I noticed that scsi_get_vpd_page() works fine from other function
 calls and that it is not 0x89, but already 0x0 that fails fixing it became
 easy.

 Nix, any chance you could verify it also works for you?

As an aside, this commit does indeed fix the bug I reported, but it
doesn't seem to have gone anywhere, not even into -stable.

Is it held up somehow?

(stable has

commit 0ac10bd036f0f3b8ce7ac2390446eab9531c72eb
Author: Martin K. Petersen martin.peter...@oracle.com
Date:   Tue Jul 30 22:58:34 2013 -0400

SCSI: Don't attempt to send extended INQUIRY command if skip_vpd_pages is 
set

which IIRC was eventually found not to be necessary, because this fix
works fine instead?)

Possibly I'm misremembering the order of month-old events and Martin's
fix was eventually considered better... in which case, sorry for the noise.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-07 Thread Nix

On 7 Aug 2013, Trond Myklebust said:

> On Wed, 2013-08-07 at 11:18 +0100, Nix wrote:
>> On 6 Aug 2013, Trond Myklebust verbalised:
>> > True. How about something like the following instead. Note the change to
>> > the original patch...
>> 
>> Well, with those applied I could reboot without a panic for the first
>> time since 3.8.x: looking good. I'll give it a reboot or two with a
>> system that's not hot from booting though.
>
> Could you please also try applying only the 1/2 patch, to see if that
> suffices to quell the shutdown panic?

It doesn't suffice. I see this severely truncated oops:

[  115.799092] BUG: unable to handle kernel NULL pointer dereference at 
0008
[  115.800284] IP: [] path_init+0x11c/0x36f
[  115.801463] PGD 0 
[  115.802625] Oops:  [#1] PREEMPT SMP 
[  115.803805] Modules linked in: [last unloaded: microcode] 
[  115.804995] CPU: 3 PID: 1191 Comm: sleep Not tainted 
3.10.5-05317-g3c9f6fa-dirty #2
[  115.806207] Hardware name: System manufacturer System Product Name/P8H61-MX 
USB3, BIOS 0506 08/10/2012
[  115.807453] task: 8804189a ti: 8803f74d6000 task.ti: 
8803f74d6000

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-07 Thread Nix

On 6 Aug 2013, Trond Myklebust verbalised:
> True. How about something like the following instead. Note the change to
> the original patch...

Well, with those applied I could reboot without a panic for the first
time since 3.8.x: looking good. I'll give it a reboot or two with a
system that's not hot from booting though.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-07 Thread Nix

On 6 Aug 2013, Trond Myklebust verbalised:
 True. How about something like the following instead. Note the change to
 the original patch...

Well, with those applied I could reboot without a panic for the first
time since 3.8.x: looking good. I'll give it a reboot or two with a
system that's not hot from booting though.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-07 Thread Nix

On 7 Aug 2013, Trond Myklebust said:

 On Wed, 2013-08-07 at 11:18 +0100, Nix wrote:
 On 6 Aug 2013, Trond Myklebust verbalised:
  True. How about something like the following instead. Note the change to
  the original patch...
 
 Well, with those applied I could reboot without a panic for the first
 time since 3.8.x: looking good. I'll give it a reboot or two with a
 system that's not hot from booting though.

 Could you please also try applying only the 1/2 patch, to see if that
 suffices to quell the shutdown panic?

It doesn't suffice. I see this severely truncated oops:

[  115.799092] BUG: unable to handle kernel NULL pointer dereference at 
0008
[  115.800284] IP: [81165ec6] path_init+0x11c/0x36f
[  115.801463] PGD 0 
[  115.802625] Oops:  [#1] PREEMPT SMP 
[  115.803805] Modules linked in: [last unloaded: microcode] 
[  115.804995] CPU: 3 PID: 1191 Comm: sleep Not tainted 
3.10.5-05317-g3c9f6fa-dirty #2
[  115.806207] Hardware name: System manufacturer System Product Name/P8H61-MX 
USB3, BIOS 0506 08/10/2012
[  115.807453] task: 8804189a ti: 8803f74d6000 task.ti: 
8803f74d6000

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-06 Thread Nix

On 5 Aug 2013, Trond Myklebust uttered the following:
> Yes. This scheme will only work if we make sure that host->h_rpcclnt is
> initialised at mount time. Here is a v2 patch that should do the right
> thing.

Confirmed, that fixes it! I'll try your shutdown crash fix next.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-06 Thread Nix

On 5 Aug 2013, Trond Myklebust uttered the following:
 Yes. This scheme will only work if we make sure that host-h_rpcclnt is
 initialised at mount time. Here is a v2 patch that should do the right
 thing.

Confirmed, that fixes it! I'll try your shutdown crash fix next.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-05 Thread Nix

On 5 Aug 2013, Trond Myklebust told this:
> Does the attached patch fix the problem?

> From 3c50ba80105464a28d456d9a1e0f1d81d4af92a8 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust 
> Date: Mon, 5 Aug 2013 12:06:12 -0400
> Subject: [PATCH] LOCKD: Don't call utsname()->nodename from
>  nlmclnt_setlockargs
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit

It makes it worse. Much, much worse. From a crash every so often when
I'm doing compilations over NFS, I get an immediate panic on startx,
long long before I even try to replicate the earlier panic:

[   83.432358] task: 88041aaa5ac0 ti: 8804199e2000 task.ti: 
8804199e2000
[   83.432428] RIP: 0010:[] [] 
encode_nlm4_lock+0x26/0xbe
[   83.432512] RSP: 0018:8804199e3a78  EFLAGS: 00010286
[   83.432564] RAX:  RBX: 88041a577038 RCX: 
[   83.432630] RDX: 8804193b3098 RSI: 88041a577038 RDI: 008c
[   83.432697] RBP: 8804199e3aa8 R08: 8804193b3098 R09: 0001
[   83.432763] R10: 88042fa12980 R11: 88042fa12980 R12: 8804199e3ae8
[   83.432830] R13: 008c R14: 8804199e3fd8 R15: 815de80e
[   83.432898] FS:  7f594b40c740() GS:88042fa0() 
knlGS:
[   83.432974] CS:  0010 DS:  ES:  CR0: 80050033
[   83.433028] CR2: 008c CR3: 00041ab3d000 CR4: 001407f0
[   83.433095] DR0:  DR1:  DR2: 
[   83.433176] DR3:  DR6: 0ff0 DR7: 0400
[   83.433255] Stack:
[   83.433276]  88041a44fb70 88040004 8804199e3ae8 
88041a577010 
[   83.433360]  8804188e0e00 8804199e3fd8 8804199e3ac8 
8124b0d7 
[   83.433443]  8804188e0e00 8124b086 8804199e3b38 
815e6032 
[   83.433616] Call Trace:
[   83.433646]  [] nlm4_xdr_enc_lockargs+0x51/0x76
[   83.433707]  [] ? nlm4_xdr_enc_cancargs+0x56/0x56
[   83.433769]  [] rpcauth_wrap_req+0x57/0x62
[   83.433826]  [] call_transmit+0x17c/0x1f9
[   83.433880]  [] __rpc_execute+0xe8/0x2ca
[   83.433935]  [] rpc_execute+0x76/0x9d
[   83.433986]  [] rpc_run_task+0x78/0x80
[   83.434039]  [] rpc_call_sync+0x88/0x9e
[   83.434092]  [] nlmclnt_call+0xb5/0x240
[   83.434146]  [] nlmclnt_proc+0x226/0x5fb
[   83.434226]  [] nfs3_proc_lock+0x21/0x23
[   83.434280]  [] do_setlk+0x65/0xee
[   83.434329]  [] nfs_lock+0x14e/0x162
[   83.434382]  [] vfs_lock_file+0x29/0x35
[   83.434435]  [] fcntl_setlk+0x139/0x2c5
[   83.434490]  [] SyS_fcntl+0x2b6/0x47d
[   83.434543]  [] system_call_fastpath+0x16/0x1b
[   83.434600] Code: 5b 41 5c 5d c3 0f 1f 44 00 00 55 31 c0 48 83 c9 ff 48 89 
e5 41 56 41 55 41 54 49 89 fc 53 48 89 f3 48 83 ec 10 4c 8b 2e 4c 89 ef  ae 
4c 89 e7 48 f7 d1 4c 8d 71 ff 41 8d 76 04 e8 9f 16 3a 00 
[   83.435077] RIP [] encode_nlm4_lock+0x26/0xbe
[   83.435140]  RSP 
[   83.435197] CR2: 008c

That's here:

(gdb) list *(encode_nlm4_lock+0x26)
0x8124af69 is in encode_nlm4_lock (fs/lockd/clnt4xdr.c:329).
324  *  string caller_name;
325  */
326 static void encode_caller_name(struct xdr_stream *xdr, const char *name)
327 {
328 /* NB: client-side does not set lock->len */
329 u32 length = strlen(name);
330 __be32 *p;
331
332 p = xdr_reserve_space(xdr, 4 + length);
333 xdr_encode_opaque(p, name, length);

   0x8124af69 <+38>:repnz scas %es:(%rdi),%al

Pretty clearly, "name" can be NULL after this patch...

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-05 Thread Nix

On 5 Aug 2013, Jeff Layton said:

> On Mon, 5 Aug 2013 11:04:27 -0400
> Jeff Layton  wrote:
>
>> On Mon, 05 Aug 2013 15:48:01 +0100
>> Nix  wrote:
>> 
>> > On 5 Aug 2013, Jeff Layton stated:
>> > 
>> > > On Sun, 04 Aug 2013 16:40:58 +0100
>> > > Nix  wrote:
>> > >
>> > >> I just got this panic on 3.10.4, in the middle of a large parallel
>> > >> compilation (of Chromium, as it happens) over NFSv3:
>> > >> 
>> > >> [16364.527516] BUG: unable to handle kernel NULL pointer dereference at 
>> > >> 0008
>> > >> [16364.527571] IP: [] nlmclnt_setlockargs+0x55/0xcf
>> > >> [16364.527611] PGD 0 
>> > >> [16364.527626] Oops:  [#1] PREEMPT SMP 
>> > >> [16364.527656] Modules linked in: [last unloaded: microcode] 
>> > >> [16364.527690] CPU: 0 PID: 17034 Comm: flock Not tainted 
>> > >> 3.10.4-05315-gf4ce424-dirty #1
>> > >> [16364.527730] Hardware name: System manufacturer System Product 
>> > >> Name/P8H61-MX USB3, BIOS 0506 08/10/2012
>> > >> [16364.527775] task: 88041a97ad60 ti: 8803501d4000 task.ti: 
>> > >> 8803501d4000
>> > >> [16364.527813] RIP: 0010:[] [] 
>> > >> nlmclnt_setlockargs+0x55/0xcf
>> > >> [16364.527860] RSP: 0018:8803501d5c58  EFLAGS: 00010282
>> > >> [16364.527889] RAX: 88041a97ad60 RBX: 8803e49c8800 RCX: 
>> > >> 
>> > >> [16364.527926] RDX:  RSI: 004a RDI: 
>> > >> 8803e49c8b54
>> > >> [16364.527962] RBP: 8803501d5c68 R08: 00015720 R09: 
>> > >> 
>> > >> [16364.527998] R10: 7000 R11: 8803501d5d58 R12: 
>> > >> 8803501d5d58
>> > >> [16364.528034] R13: 88041bd2bc00 R14:  R15: 
>> > >> 8803fc9e2900
>> > >> [16364.528070] FS:  () GS:88042fa0() 
>> > >> knlGS:
>> > >> [16364.528111] CS:  0010 DS:  ES:  CR0: 80050033
>> > >> [16364.528142] CR2: 0008 CR3: 01c0b000 CR4: 
>> > >> 001407f0
>> > >> [16364.528177] DR0:  DR1:  DR2: 
>> > >> 
>> > >> [16364.528214] DR3:  DR6: 0ff0 DR7: 
>> > >> 0400
>> > >> [16364.528303] Stack:
>> > >> [16364.528316]  8803501d5d58 8803e49c8800 8803501d5cd8 
>> > >> 81245418 
>> > >> [16364.528369]   8803516f0bc0 8803d7b7b6c0 
>> > >> 81215c81 
>> > >> [16364.528418]  88030007 88041bd2bdc8 8801aabe9650 
>> > >> 8803fc9e2900 
>> > >> [16364.528467] Call Trace:
>> > >> [16364.528485]  [] nlmclnt_proc+0x148/0x5fb
>> > >> [16364.528516]  [] ? nfs_put_lock_context+0x69/0x6e
>> > >> [16364.528550]  [] nfs3_proc_lock+0x21/0x23
>> > >> [16364.528581]  [] do_unlk+0x96/0xb2
>> > >> [16364.528608]  [] nfs_flock+0x5a/0x71
>> > >> [16364.528637]  [] locks_remove_flock+0x9e/0x113
>> > >> [16364.528668]  [] __fput+0xb6/0x1e6
>> > >> [16364.528695]  [] fput+0xe/0x10
>> > >> [16364.528724]  [] task_work_run+0x7e/0x98
>> > >> [16364.528754]  [] do_exit+0x3cc/0x8fa
>> > >> [16364.528782]  [] ? SyS_wait4+0xa5/0xc2
>> > >> [16364.528811]  [] do_group_exit+0x6f/0xa2
>> > >> [16364.528843]  [] SyS_exit_group+0x17/0x17
>> > >> [16364.528876]  [] system_call_fastpath+0x16/0x1b
>> > >> [16364.528907] Code: 00 00 65 48 8b 04 25 c0 b8 00 00 48 8b 72 20 48 81 
>> > >> ee c0 01 00 00 f3 a4 48 8d bb 54 03 00 00 be 4a 00 00 00 48 8b 90 68 05 
>> > >> 00 00 <48> 8b 52 08 48 89 bb d0 00 00 00 48 83 c2 45 48 89 53 38 48 8b 
>> > >> [16364.529176] RIP [] nlmclnt_setlockargs+0x55/0xcf
>> > >> [16364.529264]  RSP 
>> > >> [16364.529283] CR2: 0008
>> > >> [16364.539039] ---[ end trace 5a73fddf23441377 ]---
[...]
> The listing and disassembly from nlmclnt_proc is not terribly
> interesting unfortunately. You really want to do the listing and
> disassembly of the RIP at panic time (nlmclnt_setlockargs+0x55).

Oh, sorry! Wrong e

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-05 Thread Nix

On 5 Aug 2013, Jeff Layton stated:

> On Sun, 04 Aug 2013 16:40:58 +0100
> Nix  wrote:
>
>> I just got this panic on 3.10.4, in the middle of a large parallel
>> compilation (of Chromium, as it happens) over NFSv3:
>> 
>> [16364.527516] BUG: unable to handle kernel NULL pointer dereference at 
>> 0008
>> [16364.527571] IP: [] nlmclnt_setlockargs+0x55/0xcf
>> [16364.527611] PGD 0 
>> [16364.527626] Oops:  [#1] PREEMPT SMP 
>> [16364.527656] Modules linked in: [last unloaded: microcode] 
>> [16364.527690] CPU: 0 PID: 17034 Comm: flock Not tainted 
>> 3.10.4-05315-gf4ce424-dirty #1
>> [16364.527730] Hardware name: System manufacturer System Product 
>> Name/P8H61-MX USB3, BIOS 0506 08/10/2012
>> [16364.527775] task: 88041a97ad60 ti: 8803501d4000 task.ti: 
>> 8803501d4000
>> [16364.527813] RIP: 0010:[] [] 
>> nlmclnt_setlockargs+0x55/0xcf
>> [16364.527860] RSP: 0018:8803501d5c58  EFLAGS: 00010282
>> [16364.527889] RAX: 88041a97ad60 RBX: 8803e49c8800 RCX: 
>> 
>> [16364.527926] RDX:  RSI: 004a RDI: 
>> 8803e49c8b54
>> [16364.527962] RBP: 8803501d5c68 R08: 00015720 R09: 
>> 
>> [16364.527998] R10: 7000 R11: 8803501d5d58 R12: 
>> 8803501d5d58
>> [16364.528034] R13: 88041bd2bc00 R14:  R15: 
>> 8803fc9e2900
>> [16364.528070] FS:  () GS:88042fa0() 
>> knlGS:
>> [16364.528111] CS:  0010 DS:  ES:  CR0: 80050033
>> [16364.528142] CR2: 0008 CR3: 01c0b000 CR4: 
>> 001407f0
>> [16364.528177] DR0:  DR1:  DR2: 
>> 
>> [16364.528214] DR3:  DR6: 0ff0 DR7: 
>> 0400
>> [16364.528303] Stack:
>> [16364.528316]  8803501d5d58 8803e49c8800 8803501d5cd8 
>> 81245418 
>> [16364.528369]   8803516f0bc0 8803d7b7b6c0 
>> 81215c81 
>> [16364.528418]  88030007 88041bd2bdc8 8801aabe9650 
>> 8803fc9e2900 
>> [16364.528467] Call Trace:
>> [16364.528485]  [] nlmclnt_proc+0x148/0x5fb
>> [16364.528516]  [] ? nfs_put_lock_context+0x69/0x6e
>> [16364.528550]  [] nfs3_proc_lock+0x21/0x23
>> [16364.528581]  [] do_unlk+0x96/0xb2
>> [16364.528608]  [] nfs_flock+0x5a/0x71
>> [16364.528637]  [] locks_remove_flock+0x9e/0x113
>> [16364.528668]  [] __fput+0xb6/0x1e6
>> [16364.528695]  [] fput+0xe/0x10
>> [16364.528724]  [] task_work_run+0x7e/0x98
>> [16364.528754]  [] do_exit+0x3cc/0x8fa
>> [16364.528782]  [] ? SyS_wait4+0xa5/0xc2
>> [16364.528811]  [] do_group_exit+0x6f/0xa2
>> [16364.528843]  [] SyS_exit_group+0x17/0x17
>> [16364.528876]  [] system_call_fastpath+0x16/0x1b
>> [16364.528907] Code: 00 00 65 48 8b 04 25 c0 b8 00 00 48 8b 72 20 48 81 ee 
>> c0 01 00 00 f3 a4 48 8d bb 54 03 00 00 be 4a 00 00 00 48 8b 90 68 05 00 00 
>> <48> 8b 52 08 48 89 bb d0 00 00 00 48 83 c2 45 48 89 53 38 48 8b 
>> [16364.529176] RIP [] nlmclnt_setlockargs+0x55/0xcf
>> [16364.529264]  RSP 
>> [16364.529283] CR2: 0008
>> [16364.539039] ---[ end trace 5a73fddf23441377 ]---
>
> What might be most helpful is to figure out exactly where the above
> panic occurred.

OK. My kernel is non-modular and built without debugging information:
rebuilt with debugging info, and got this:

0x81245418 is in nlmclnt_proc (fs/lockd/clntproc.c:172).
167 return -ENOMEM;
168 }
169 /* Set up the argument struct */
170 nlmclnt_setlockargs(call, fl);
171
172 if (IS_SETLK(cmd) || IS_SETLKW(cmd)) {
173 if (fl->fl_type != F_UNLCK) {
174 call->a_args.block = IS_SETLKW(cmd) ? 1 : 0;
175 status = nlmclnt_lock(call, fl);
176 } else

That's decimal 328:

   0x81245413 <+323>:   callq  0x81245102 
   0x81245418 <+328>:   mov-0x40(%rbp),%eax
   0x8124541b <+331>:   sub$0x6,%eax
   0x8124541e <+334>:   cmp$0x1,%eax

nlm_alloc_call() cannot fail (we have a NULL check right there), and fl
also cannot be NULL because it's dereferenced in nfs_flock(), up the
call chain from where we are.

Time to stick some printk()s in, I susupect. (Not sure how to keep them
from utterly flooding the log, though.)

>> This is the same machine on which this panic has been occurring on
>> shutdown since 3.9.x: Al Viro has previously point

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-05 Thread Nix

On 5 Aug 2013, Jeff Layton stated:

 On Sun, 04 Aug 2013 16:40:58 +0100
 Nix n...@esperi.org.uk wrote:

 I just got this panic on 3.10.4, in the middle of a large parallel
 compilation (of Chromium, as it happens) over NFSv3:
 
 [16364.527516] BUG: unable to handle kernel NULL pointer dereference at 
 0008
 [16364.527571] IP: [81245157] nlmclnt_setlockargs+0x55/0xcf
 [16364.527611] PGD 0 
 [16364.527626] Oops:  [#1] PREEMPT SMP 
 [16364.527656] Modules linked in: [last unloaded: microcode] 
 [16364.527690] CPU: 0 PID: 17034 Comm: flock Not tainted 
 3.10.4-05315-gf4ce424-dirty #1
 [16364.527730] Hardware name: System manufacturer System Product 
 Name/P8H61-MX USB3, BIOS 0506 08/10/2012
 [16364.527775] task: 88041a97ad60 ti: 8803501d4000 task.ti: 
 8803501d4000
 [16364.527813] RIP: 0010:[81245157] [81245157] 
 nlmclnt_setlockargs+0x55/0xcf
 [16364.527860] RSP: 0018:8803501d5c58  EFLAGS: 00010282
 [16364.527889] RAX: 88041a97ad60 RBX: 8803e49c8800 RCX: 
 
 [16364.527926] RDX:  RSI: 004a RDI: 
 8803e49c8b54
 [16364.527962] RBP: 8803501d5c68 R08: 00015720 R09: 
 
 [16364.527998] R10: 7000 R11: 8803501d5d58 R12: 
 8803501d5d58
 [16364.528034] R13: 88041bd2bc00 R14:  R15: 
 8803fc9e2900
 [16364.528070] FS:  () GS:88042fa0() 
 knlGS:
 [16364.528111] CS:  0010 DS:  ES:  CR0: 80050033
 [16364.528142] CR2: 0008 CR3: 01c0b000 CR4: 
 001407f0
 [16364.528177] DR0:  DR1:  DR2: 
 
 [16364.528214] DR3:  DR6: 0ff0 DR7: 
 0400
 [16364.528303] Stack:
 [16364.528316]  8803501d5d58 8803e49c8800 8803501d5cd8 
 81245418 
 [16364.528369]   8803516f0bc0 8803d7b7b6c0 
 81215c81 
 [16364.528418]  88030007 88041bd2bdc8 8801aabe9650 
 8803fc9e2900 
 [16364.528467] Call Trace:
 [16364.528485]  [81245418] nlmclnt_proc+0x148/0x5fb
 [16364.528516]  [81215c81] ? nfs_put_lock_context+0x69/0x6e
 [16364.528550]  [812209a2] nfs3_proc_lock+0x21/0x23
 [16364.528581]  [812149dd] do_unlk+0x96/0xb2
 [16364.528608]  [81214b41] nfs_flock+0x5a/0x71
 [16364.528637]  [8119a747] locks_remove_flock+0x9e/0x113
 [16364.528668]  [8115cc68] __fput+0xb6/0x1e6
 [16364.528695]  [8115cda6] fput+0xe/0x10
 [16364.528724]  [810998da] task_work_run+0x7e/0x98
 [16364.528754]  [81082bc5] do_exit+0x3cc/0x8fa
 [16364.528782]  [81083501] ? SyS_wait4+0xa5/0xc2
 [16364.528811]  [8108328d] do_group_exit+0x6f/0xa2
 [16364.528843]  [810832d7] SyS_exit_group+0x17/0x17
 [16364.528876]  [81613e92] system_call_fastpath+0x16/0x1b
 [16364.528907] Code: 00 00 65 48 8b 04 25 c0 b8 00 00 48 8b 72 20 48 81 ee 
 c0 01 00 00 f3 a4 48 8d bb 54 03 00 00 be 4a 00 00 00 48 8b 90 68 05 00 00 
 48 8b 52 08 48 89 bb d0 00 00 00 48 83 c2 45 48 89 53 38 48 8b 
 [16364.529176] RIP [81245157] nlmclnt_setlockargs+0x55/0xcf
 [16364.529264]  RSP 8803501d5c58
 [16364.529283] CR2: 0008
 [16364.539039] ---[ end trace 5a73fddf23441377 ]---

 What might be most helpful is to figure out exactly where the above
 panic occurred.

OK. My kernel is non-modular and built without debugging information:
rebuilt with debugging info, and got this:

0x81245418 is in nlmclnt_proc (fs/lockd/clntproc.c:172).
167 return -ENOMEM;
168 }
169 /* Set up the argument struct */
170 nlmclnt_setlockargs(call, fl);
171
172 if (IS_SETLK(cmd) || IS_SETLKW(cmd)) {
173 if (fl-fl_type != F_UNLCK) {
174 call-a_args.block = IS_SETLKW(cmd) ? 1 : 0;
175 status = nlmclnt_lock(call, fl);
176 } else

That's decimal 328:

   0x81245413 +323:   callq  0x81245102 nlmclnt_setlockargs
   0x81245418 +328:   mov-0x40(%rbp),%eax
   0x8124541b +331:   sub$0x6,%eax
   0x8124541e +334:   cmp$0x1,%eax

nlm_alloc_call() cannot fail (we have a NULL check right there), and fl
also cannot be NULL because it's dereferenced in nfs_flock(), up the
call chain from where we are.

Time to stick some printk()s in, I susupect. (Not sure how to keep them
from utterly flooding the log, though.)

 This is the same machine on which this panic has been occurring on
 shutdown since 3.9.x: Al Viro has previously pointed out the problem and
 nothing has happened:
 
 [50618.993226] BUG: unable to handle kernel NULL pointer dereference at 
 0008
 [50618.993904] IP: [81165e76] path_init+0x11c/0x36f
 [50618.994609] PGD 0 
 [50618.995329] Oops:  [#1] PREEMPT SMP

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-05 Thread Nix

On 5 Aug 2013, Jeff Layton said:

 On Mon, 5 Aug 2013 11:04:27 -0400
 Jeff Layton jlay...@redhat.com wrote:

 On Mon, 05 Aug 2013 15:48:01 +0100
 Nix n...@esperi.org.uk wrote:
 
  On 5 Aug 2013, Jeff Layton stated:
  
   On Sun, 04 Aug 2013 16:40:58 +0100
   Nix n...@esperi.org.uk wrote:
  
   I just got this panic on 3.10.4, in the middle of a large parallel
   compilation (of Chromium, as it happens) over NFSv3:
   
   [16364.527516] BUG: unable to handle kernel NULL pointer dereference at 
   0008
   [16364.527571] IP: [81245157] nlmclnt_setlockargs+0x55/0xcf
   [16364.527611] PGD 0 
   [16364.527626] Oops:  [#1] PREEMPT SMP 
   [16364.527656] Modules linked in: [last unloaded: microcode] 
   [16364.527690] CPU: 0 PID: 17034 Comm: flock Not tainted 
   3.10.4-05315-gf4ce424-dirty #1
   [16364.527730] Hardware name: System manufacturer System Product 
   Name/P8H61-MX USB3, BIOS 0506 08/10/2012
   [16364.527775] task: 88041a97ad60 ti: 8803501d4000 task.ti: 
   8803501d4000
   [16364.527813] RIP: 0010:[81245157] [81245157] 
   nlmclnt_setlockargs+0x55/0xcf
   [16364.527860] RSP: 0018:8803501d5c58  EFLAGS: 00010282
   [16364.527889] RAX: 88041a97ad60 RBX: 8803e49c8800 RCX: 
   
   [16364.527926] RDX:  RSI: 004a RDI: 
   8803e49c8b54
   [16364.527962] RBP: 8803501d5c68 R08: 00015720 R09: 
   
   [16364.527998] R10: 7000 R11: 8803501d5d58 R12: 
   8803501d5d58
   [16364.528034] R13: 88041bd2bc00 R14:  R15: 
   8803fc9e2900
   [16364.528070] FS:  () GS:88042fa0() 
   knlGS:
   [16364.528111] CS:  0010 DS:  ES:  CR0: 80050033
   [16364.528142] CR2: 0008 CR3: 01c0b000 CR4: 
   001407f0
   [16364.528177] DR0:  DR1:  DR2: 
   
   [16364.528214] DR3:  DR6: 0ff0 DR7: 
   0400
   [16364.528303] Stack:
   [16364.528316]  8803501d5d58 8803e49c8800 8803501d5cd8 
   81245418 
   [16364.528369]   8803516f0bc0 8803d7b7b6c0 
   81215c81 
   [16364.528418]  88030007 88041bd2bdc8 8801aabe9650 
   8803fc9e2900 
   [16364.528467] Call Trace:
   [16364.528485]  [81245418] nlmclnt_proc+0x148/0x5fb
   [16364.528516]  [81215c81] ? nfs_put_lock_context+0x69/0x6e
   [16364.528550]  [812209a2] nfs3_proc_lock+0x21/0x23
   [16364.528581]  [812149dd] do_unlk+0x96/0xb2
   [16364.528608]  [81214b41] nfs_flock+0x5a/0x71
   [16364.528637]  [8119a747] locks_remove_flock+0x9e/0x113
   [16364.528668]  [8115cc68] __fput+0xb6/0x1e6
   [16364.528695]  [8115cda6] fput+0xe/0x10
   [16364.528724]  [810998da] task_work_run+0x7e/0x98
   [16364.528754]  [81082bc5] do_exit+0x3cc/0x8fa
   [16364.528782]  [81083501] ? SyS_wait4+0xa5/0xc2
   [16364.528811]  [8108328d] do_group_exit+0x6f/0xa2
   [16364.528843]  [810832d7] SyS_exit_group+0x17/0x17
   [16364.528876]  [81613e92] system_call_fastpath+0x16/0x1b
   [16364.528907] Code: 00 00 65 48 8b 04 25 c0 b8 00 00 48 8b 72 20 48 81 
   ee c0 01 00 00 f3 a4 48 8d bb 54 03 00 00 be 4a 00 00 00 48 8b 90 68 05 
   00 00 48 8b 52 08 48 89 bb d0 00 00 00 48 83 c2 45 48 89 53 38 48 8b 
   [16364.529176] RIP [81245157] nlmclnt_setlockargs+0x55/0xcf
   [16364.529264]  RSP 8803501d5c58
   [16364.529283] CR2: 0008
   [16364.539039] ---[ end trace 5a73fddf23441377 ]---
[...]
 The listing and disassembly from nlmclnt_proc is not terribly
 interesting unfortunately. You really want to do the listing and
 disassembly of the RIP at panic time (nlmclnt_setlockargs+0x55).

Oh, sorry! Wrong end of the oops :)

0x81245157 is in nlmclnt_setlockargs (fs/lockd/clntproc.c:131).
126 struct nlm_args *argp = req-a_args;
127 struct nlm_lock *lock = argp-lock;
128
129 nlmclnt_next_cookie(argp-cookie);
130 memcpy(lock-fh, NFS_FH(file_inode(fl-fl_file)), 
sizeof(struct nfs_fh));
131 lock-caller  = utsname()-nodename;
132 lock-oh.data = req-a_owner;
133 lock-oh.len  = snprintf(req-a_owner, sizeof(req-a_owner), 
%u@%s,
134 (unsigned 
int)fl-fl_u.nfs_fl.owner-pid,
135 utsname()-nodename);

   0x81245102 +0: callq  0x81613b00 __fentry__
   0x81245107 +5: push   %rbp
   0x81245108 +6: mov%rsp,%rbp
   0x8124510b +9: push   %r12
   0x8124510d +11:mov%rsi,%r12
   0x81245110 +14:push   %rbx
   0x81245111 +15:mov%rdi,%rbx
   0x81245114 +18:lea0x10(%rdi),%rdi
   0x81245118 +22:callq

Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-05 Thread Nix

On 5 Aug 2013, Trond Myklebust told this:
 Does the attached patch fix the problem?

 From 3c50ba80105464a28d456d9a1e0f1d81d4af92a8 Mon Sep 17 00:00:00 2001
 From: Trond Myklebust trond.mykleb...@netapp.com
 Date: Mon, 5 Aug 2013 12:06:12 -0400
 Subject: [PATCH] LOCKD: Don't call utsname()-nodename from
  nlmclnt_setlockargs
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit

It makes it worse. Much, much worse. From a crash every so often when
I'm doing compilations over NFS, I get an immediate panic on startx,
long long before I even try to replicate the earlier panic:

[   83.432358] task: 88041aaa5ac0 ti: 8804199e2000 task.ti: 
8804199e2000
[   83.432428] RIP: 0010:[8124af69] [8124af69] 
encode_nlm4_lock+0x26/0xbe
[   83.432512] RSP: 0018:8804199e3a78  EFLAGS: 00010286
[   83.432564] RAX:  RBX: 88041a577038 RCX: 
[   83.432630] RDX: 8804193b3098 RSI: 88041a577038 RDI: 008c
[   83.432697] RBP: 8804199e3aa8 R08: 8804193b3098 R09: 0001
[   83.432763] R10: 88042fa12980 R11: 88042fa12980 R12: 8804199e3ae8
[   83.432830] R13: 008c R14: 8804199e3fd8 R15: 815de80e
[   83.432898] FS:  7f594b40c740() GS:88042fa0() 
knlGS:
[   83.432974] CS:  0010 DS:  ES:  CR0: 80050033
[   83.433028] CR2: 008c CR3: 00041ab3d000 CR4: 001407f0
[   83.433095] DR0:  DR1:  DR2: 
[   83.433176] DR3:  DR6: 0ff0 DR7: 0400
[   83.433255] Stack:
[   83.433276]  88041a44fb70 88040004 8804199e3ae8 
88041a577010 
[   83.433360]  8804188e0e00 8804199e3fd8 8804199e3ac8 
8124b0d7 
[   83.433443]  8804188e0e00 8124b086 8804199e3b38 
815e6032 
[   83.433616] Call Trace:
[   83.433646]  [8124b0d7] nlm4_xdr_enc_lockargs+0x51/0x76
[   83.433707]  [8124b086] ? nlm4_xdr_enc_cancargs+0x56/0x56
[   83.433769]  [815e6032] rpcauth_wrap_req+0x57/0x62
[   83.433826]  [815de98a] call_transmit+0x17c/0x1f9
[   83.433880]  [815e4e58] __rpc_execute+0xe8/0x2ca
[   83.433935]  [815e50f9] rpc_execute+0x76/0x9d
[   83.433986]  [815debc1] rpc_run_task+0x78/0x80
[   83.434039]  [815decff] rpc_call_sync+0x88/0x9e
[   83.434092]  [81244b3c] nlmclnt_call+0xb5/0x240
[   83.434146]  [812454f0] nlmclnt_proc+0x226/0x5fb
[   83.434226]  [812209a2] nfs3_proc_lock+0x21/0x23
[   83.434280]  [81214a5e] do_setlk+0x65/0xee
[   83.434329]  [81214ca6] nfs_lock+0x14e/0x162
[   83.434382]  [81199661] vfs_lock_file+0x29/0x35
[   83.434435]  [8119a51d] fcntl_setlk+0x139/0x2c5
[   83.434490]  [81169621] SyS_fcntl+0x2b6/0x47d
[   83.434543]  [81613e92] system_call_fastpath+0x16/0x1b
[   83.434600] Code: 5b 41 5c 5d c3 0f 1f 44 00 00 55 31 c0 48 83 c9 ff 48 89 
e5 41 56 41 55 41 54 49 89 fc 53 48 89 f3 48 83 ec 10 4c 8b 2e 4c 89 ef f2 ae 
4c 89 e7 48 f7 d1 4c 8d 71 ff 41 8d 76 04 e8 9f 16 3a 00 
[   83.435077] RIP [8124af69] encode_nlm4_lock+0x26/0xbe
[   83.435140]  RSP 8804199e3a78
[   83.435197] CR2: 008c

That's here:

(gdb) list *(encode_nlm4_lock+0x26)
0x8124af69 is in encode_nlm4_lock (fs/lockd/clnt4xdr.c:329).
324  *  string caller_nameLM_MAXSTRLEN;
325  */
326 static void encode_caller_name(struct xdr_stream *xdr, const char *name)
327 {
328 /* NB: client-side does not set lock-len */
329 u32 length = strlen(name);
330 __be32 *p;
331
332 p = xdr_reserve_space(xdr, 4 + length);
333 xdr_encode_opaque(p, name, length);

   0x8124af69 +38:repnz scas %es:(%rdi),%al

Pretty clearly, name can be NULL after this patch...

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-04 Thread Nix

I just got this panic on 3.10.4, in the middle of a large parallel
compilation (of Chromium, as it happens) over NFSv3:

[16364.527516] BUG: unable to handle kernel NULL pointer dereference at 
0008
[16364.527571] IP: [] nlmclnt_setlockargs+0x55/0xcf
[16364.527611] PGD 0 
[16364.527626] Oops:  [#1] PREEMPT SMP 
[16364.527656] Modules linked in: [last unloaded: microcode] 
[16364.527690] CPU: 0 PID: 17034 Comm: flock Not tainted 
3.10.4-05315-gf4ce424-dirty #1
[16364.527730] Hardware name: System manufacturer System Product Name/P8H61-MX 
USB3, BIOS 0506 08/10/2012
[16364.527775] task: 88041a97ad60 ti: 8803501d4000 task.ti: 
8803501d4000
[16364.527813] RIP: 0010:[] [] 
nlmclnt_setlockargs+0x55/0xcf
[16364.527860] RSP: 0018:8803501d5c58  EFLAGS: 00010282
[16364.527889] RAX: 88041a97ad60 RBX: 8803e49c8800 RCX: 
[16364.527926] RDX:  RSI: 004a RDI: 8803e49c8b54
[16364.527962] RBP: 8803501d5c68 R08: 00015720 R09: 
[16364.527998] R10: 7000 R11: 8803501d5d58 R12: 8803501d5d58
[16364.528034] R13: 88041bd2bc00 R14:  R15: 8803fc9e2900
[16364.528070] FS:  () GS:88042fa0() 
knlGS:
[16364.528111] CS:  0010 DS:  ES:  CR0: 80050033
[16364.528142] CR2: 0008 CR3: 01c0b000 CR4: 001407f0
[16364.528177] DR0:  DR1:  DR2: 
[16364.528214] DR3:  DR6: 0ff0 DR7: 0400
[16364.528303] Stack:
[16364.528316]  8803501d5d58 8803e49c8800 8803501d5cd8 
81245418 
[16364.528369]   8803516f0bc0 8803d7b7b6c0 
81215c81 
[16364.528418]  88030007 88041bd2bdc8 8801aabe9650 
8803fc9e2900 
[16364.528467] Call Trace:
[16364.528485]  [] nlmclnt_proc+0x148/0x5fb
[16364.528516]  [] ? nfs_put_lock_context+0x69/0x6e
[16364.528550]  [] nfs3_proc_lock+0x21/0x23
[16364.528581]  [] do_unlk+0x96/0xb2
[16364.528608]  [] nfs_flock+0x5a/0x71
[16364.528637]  [] locks_remove_flock+0x9e/0x113
[16364.528668]  [] __fput+0xb6/0x1e6
[16364.528695]  [] fput+0xe/0x10
[16364.528724]  [] task_work_run+0x7e/0x98
[16364.528754]  [] do_exit+0x3cc/0x8fa
[16364.528782]  [] ? SyS_wait4+0xa5/0xc2
[16364.528811]  [] do_group_exit+0x6f/0xa2
[16364.528843]  [] SyS_exit_group+0x17/0x17
[16364.528876]  [] system_call_fastpath+0x16/0x1b
[16364.528907] Code: 00 00 65 48 8b 04 25 c0 b8 00 00 48 8b 72 20 48 81 ee c0 
01 00 00 f3 a4 48 8d bb 54 03 00 00 be 4a 00 00 00 48 8b 90 68 05 00 00 <48> 8b 
52 08 48 89 bb d0 00 00 00 48 83 c2 45 48 89 53 38 48 8b 
[16364.529176] RIP [] nlmclnt_setlockargs+0x55/0xcf
[16364.529264]  RSP 
[16364.529283] CR2: 0008
[16364.539039] ---[ end trace 5a73fddf23441377 ]---

This is the same machine on which this panic has been occurring on
shutdown since 3.9.x: Al Viro has previously pointed out the problem and
nothing has happened:

[50618.993226] BUG: unable to handle kernel NULL pointer dereference at 
0008
[50618.993904] IP: [] path_init+0x11c/0x36f
[50618.994609] PGD 0 
[50618.995329] Oops:  [#1] PREEMPT SMP 
[50618.996027] Modules linked in: [last unloaded: microcode] 
[50618.996758] CPU: 3 PID: 1262 Comm: pulseaudio Not tainted 
3.10.4-05315-gf4ce424-dirty #1
[50618.997506] Hardware name: System manufacturer System Product Name/P8H61-MX 
USB3, BIOS 0506 08/10/2012
[50618.998268] task: 88041bf1ad60 ti: 88041b19e000 task.ti: 
88041b19e000
[50618.999017] RIP: 0010:[] [] 
path_init+0x11c/0x36f
[50618.999804] RSP: 0018:88041b19f508  EFLAGS: 00010246
[50619.000592] RAX:  RBX: 88041b19f658 RCX: 005c
[50619.001398] RDX: 5c5c RSI: 880419b3781a RDI: 81c34a10
[50619.002198] RBP: 88041b19f558 R08: 88041b19f588 R09: 88041b19f7c4
[50619.002999] R10: ff9c R11: 88041b19f658 R12: 0041
[50619.003816] R13: 0040 R14: 880419b3781a R15: 88041b19f7c4
[50619.004638] FS:  7fca19bc2740() GS:88042fac() 
knlGS:
[50619.005465] CS:  0010 DS:  ES:  CR0: 80050033
[50619.006284] CR2: 0008 CR3: 01c0b000 CR4: 001407e0
[50619.007092] DR0:  DR1:  DR2: 
[50619.007922] DR3:  DR6: 0ff0 DR7: 0400
[50619.008750] Stack: [50619.009576]  88041b19f518 ffbfaa5e 
 8151e735 
[50619.010437]  c900080ae000 88041b19f658 0041 
880419b3781a 
[50619.011292]  88041b19f628 88041b19f7c4 88041b19f5e8 
811660fc 
[50619.012119] Call Trace:
[50619.012947]  [] ? skb_checksum+0x4f/0x25b
[50619.013782]  [] path_lookupat+0x33/0x6c5
[50619.014618]  [] ? dev_hard_start_xmit+0x2e5/0x50b
[50619.015457]  []

[3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*

2013-08-04 Thread Nix

I just got this panic on 3.10.4, in the middle of a large parallel
compilation (of Chromium, as it happens) over NFSv3:

[16364.527516] BUG: unable to handle kernel NULL pointer dereference at 
0008
[16364.527571] IP: [81245157] nlmclnt_setlockargs+0x55/0xcf
[16364.527611] PGD 0 
[16364.527626] Oops:  [#1] PREEMPT SMP 
[16364.527656] Modules linked in: [last unloaded: microcode] 
[16364.527690] CPU: 0 PID: 17034 Comm: flock Not tainted 
3.10.4-05315-gf4ce424-dirty #1
[16364.527730] Hardware name: System manufacturer System Product Name/P8H61-MX 
USB3, BIOS 0506 08/10/2012
[16364.527775] task: 88041a97ad60 ti: 8803501d4000 task.ti: 
8803501d4000
[16364.527813] RIP: 0010:[81245157] [81245157] 
nlmclnt_setlockargs+0x55/0xcf
[16364.527860] RSP: 0018:8803501d5c58  EFLAGS: 00010282
[16364.527889] RAX: 88041a97ad60 RBX: 8803e49c8800 RCX: 
[16364.527926] RDX:  RSI: 004a RDI: 8803e49c8b54
[16364.527962] RBP: 8803501d5c68 R08: 00015720 R09: 
[16364.527998] R10: 7000 R11: 8803501d5d58 R12: 8803501d5d58
[16364.528034] R13: 88041bd2bc00 R14:  R15: 8803fc9e2900
[16364.528070] FS:  () GS:88042fa0() 
knlGS:
[16364.528111] CS:  0010 DS:  ES:  CR0: 80050033
[16364.528142] CR2: 0008 CR3: 01c0b000 CR4: 001407f0
[16364.528177] DR0:  DR1:  DR2: 
[16364.528214] DR3:  DR6: 0ff0 DR7: 0400
[16364.528303] Stack:
[16364.528316]  8803501d5d58 8803e49c8800 8803501d5cd8 
81245418 
[16364.528369]   8803516f0bc0 8803d7b7b6c0 
81215c81 
[16364.528418]  88030007 88041bd2bdc8 8801aabe9650 
8803fc9e2900 
[16364.528467] Call Trace:
[16364.528485]  [81245418] nlmclnt_proc+0x148/0x5fb
[16364.528516]  [81215c81] ? nfs_put_lock_context+0x69/0x6e
[16364.528550]  [812209a2] nfs3_proc_lock+0x21/0x23
[16364.528581]  [812149dd] do_unlk+0x96/0xb2
[16364.528608]  [81214b41] nfs_flock+0x5a/0x71
[16364.528637]  [8119a747] locks_remove_flock+0x9e/0x113
[16364.528668]  [8115cc68] __fput+0xb6/0x1e6
[16364.528695]  [8115cda6] fput+0xe/0x10
[16364.528724]  [810998da] task_work_run+0x7e/0x98
[16364.528754]  [81082bc5] do_exit+0x3cc/0x8fa
[16364.528782]  [81083501] ? SyS_wait4+0xa5/0xc2
[16364.528811]  [8108328d] do_group_exit+0x6f/0xa2
[16364.528843]  [810832d7] SyS_exit_group+0x17/0x17
[16364.528876]  [81613e92] system_call_fastpath+0x16/0x1b
[16364.528907] Code: 00 00 65 48 8b 04 25 c0 b8 00 00 48 8b 72 20 48 81 ee c0 
01 00 00 f3 a4 48 8d bb 54 03 00 00 be 4a 00 00 00 48 8b 90 68 05 00 00 48 8b 
52 08 48 89 bb d0 00 00 00 48 83 c2 45 48 89 53 38 48 8b 
[16364.529176] RIP [81245157] nlmclnt_setlockargs+0x55/0xcf
[16364.529264]  RSP 8803501d5c58
[16364.529283] CR2: 0008
[16364.539039] ---[ end trace 5a73fddf23441377 ]---

This is the same machine on which this panic has been occurring on
shutdown since 3.9.x: Al Viro has previously pointed out the problem and
nothing has happened:

[50618.993226] BUG: unable to handle kernel NULL pointer dereference at 
0008
[50618.993904] IP: [81165e76] path_init+0x11c/0x36f
[50618.994609] PGD 0 
[50618.995329] Oops:  [#1] PREEMPT SMP 
[50618.996027] Modules linked in: [last unloaded: microcode] 
[50618.996758] CPU: 3 PID: 1262 Comm: pulseaudio Not tainted 
3.10.4-05315-gf4ce424-dirty #1
[50618.997506] Hardware name: System manufacturer System Product Name/P8H61-MX 
USB3, BIOS 0506 08/10/2012
[50618.998268] task: 88041bf1ad60 ti: 88041b19e000 task.ti: 
88041b19e000
[50618.999017] RIP: 0010:[81165e76] [81165e76] 
path_init+0x11c/0x36f
[50618.999804] RSP: 0018:88041b19f508  EFLAGS: 00010246
[50619.000592] RAX:  RBX: 88041b19f658 RCX: 005c
[50619.001398] RDX: 5c5c RSI: 880419b3781a RDI: 81c34a10
[50619.002198] RBP: 88041b19f558 R08: 88041b19f588 R09: 88041b19f7c4
[50619.002999] R10: ff9c R11: 88041b19f658 R12: 0041
[50619.003816] R13: 0040 R14: 880419b3781a R15: 88041b19f7c4
[50619.004638] FS:  7fca19bc2740() GS:88042fac() 
knlGS:
[50619.005465] CS:  0010 DS:  ES:  CR0: 80050033
[50619.006284] CR2: 0008 CR3: 01c0b000 CR4: 001407e0
[50619.007092] DR0:  DR1:  DR2: 
[50619.007922] DR3:  DR6: 0ff0 DR7: 0400
[50619.008750] Stack: [50619.009576]  88041b19f518 ffbfaa5e 
 8151e735 
[50619.010437]

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-08-01 Thread Nix

On 1 Aug 2013, Bernd Schubert verbalised:

> On 07/30/2013 11:20 PM, Nix wrote:
>> On 30 Jul 2013, Bernd Schubert told this:
>>
>>> On 07/30/2013 02:56 AM, Nix wrote:
>>>> On 30 Jul 2013, Douglas Gilbert outgrape:
>>>>
>>>>> Please supply the information that Martin Petersen asked
>>>>> for.
>>>>
>>>> Did it in private IRC (the advantage of working for the same division of
>>>> the same company!)
>>>>
>>>> I didn't realise the original fix was actually implemented to allow
>>>> Bernd, with a different Areca controller, to boot... obviously, in that
>>>> situation, reversion is wrong, since that would just replace one won't-
>>>> boot situation with another.
>>>
>>> Unless there is very simple fix the commit should reverted, imho. It
>>> would better then to remove write-same support from the md-layer.
>>
>> I'm not using md on that machine, just LVM. Our suspicion is that ext4
>> is doing a WRITE SAME for some reason.
>
> I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
> lazy init it also will happen after mounting the file system, while
> lazy init is running (inode zeroing).

Well, it'll happen the first few times you mount the fs. If your fs is
years old (as mine are) the inode tables will probably have been
initialized by now!

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-08-01 Thread Nix

On 1 Aug 2013, Bernd Schubert verbalised:

 On 07/30/2013 11:20 PM, Nix wrote:
 On 30 Jul 2013, Bernd Schubert told this:

 On 07/30/2013 02:56 AM, Nix wrote:
 On 30 Jul 2013, Douglas Gilbert outgrape:

 Please supply the information that Martin Petersen asked
 for.

 Did it in private IRC (the advantage of working for the same division of
 the same company!)

 I didn't realise the original fix was actually implemented to allow
 Bernd, with a different Areca controller, to boot... obviously, in that
 situation, reversion is wrong, since that would just replace one won't-
 boot situation with another.

 Unless there is very simple fix the commit should reverted, imho. It
 would better then to remove write-same support from the md-layer.

 I'm not using md on that machine, just LVM. Our suspicion is that ext4
 is doing a WRITE SAME for some reason.

 I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
 lazy init it also will happen after mounting the file system, while
 lazy init is running (inode zeroing).

Well, it'll happen the first few times you mount the fs. If your fs is
years old (as mine are) the inode tables will probably have been
initialized by now!

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Nix

On 30 Jul 2013, Bernd Schubert told this:

> On 07/30/2013 02:56 AM, Nix wrote:
>> On 30 Jul 2013, Douglas Gilbert outgrape:
>>
>>> Please supply the information that Martin Petersen asked
>>> for.
>>
>> Did it in private IRC (the advantage of working for the same division of
>> the same company!)
>>
>> I didn't realise the original fix was actually implemented to allow
>> Bernd, with a different Areca controller, to boot... obviously, in that
>> situation, reversion is wrong, since that would just replace one won't-
>> boot situation with another.
>
> Unless there is very simple fix the commit should reverted, imho. It
> would better then to remove write-same support from the md-layer.

I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Nix

On 30 Jul 2013, Bernd Schubert told this:

 On 07/30/2013 02:56 AM, Nix wrote:
 On 30 Jul 2013, Douglas Gilbert outgrape:

 Please supply the information that Martin Petersen asked
 for.

 Did it in private IRC (the advantage of working for the same division of
 the same company!)

 I didn't realise the original fix was actually implemented to allow
 Bernd, with a different Areca controller, to boot... obviously, in that
 situation, reversion is wrong, since that would just replace one won't-
 boot situation with another.

 Unless there is very simple fix the commit should reverted, imho. It
 would better then to remove write-same support from the md-layer.

I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Nix

On 30 Jul 2013, Douglas Gilbert outgrape:

> Please supply the information that Martin Petersen asked
> for.

Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.

It looks like a solution is possible that will let us boot *both* my
controller (with its old 2009-era firmware) *and* his. We just have
to let Martin implement it. Give him time, I only got a successful
boot out of it an hour ago :)

> I just examined a more recent Areca SAS RAID controller
> and would describe it as the SCSI device from hell. One solution
> to this problem is to modify the arcmsr driver so it returns
> a more consistent set of lies to the management SCSI commands that
> Martin is asking about.

I can't help notice that something is skewy in its error handling, too.
When the controller errors, even resetting the bus doesn't seem to be
enough to bring it back :/ I've seen errors from it before which did
*not* lead to it imploding forever, but this is apparently not one such.

Certainly Areca-the-company has... issues with communication with the
community (i.e., they don't). A shame I didn't know that before I bought
the controller and made all my data completely dependent on it, really.
Shame, the controller otherwise works very well (fast, and has coped
with a disk failure with aplomb).

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 311 matches

Mail list logo