Re: Fatal double fault in ZFS with yesterday's CURRENT [SOLVED]

2014-05-04 Thread Fabian Keil
"Steven Hartland"  wrote:

> Thanks for your help testing this Fabian, I've now committed the fix for
> this for this:
> http://svnweb.freebsd.org/changeset/base/265321

Thanks a lot, Steve.

Fabian


signature.asc
Description: PGP signature


Re: Fatal double fault in ZFS with yesterday's CURRENT

2014-05-04 Thread Steven Hartland


- Original Message - 
From: "Fabian Keil" 


Thanks for your help testing this Fabian, I've now committed the fix for
this for this:
http://svnweb.freebsd.org/changeset/base/265321

   Regards
   Steve
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Fatal double fault in ZFS with yesterday's CURRENT

2014-05-04 Thread Fabian Keil
"Steven Hartland"  wrote:

> > "Steven Hartland"  wrote:
> > 
> > > From: "Fabian Keil" 
> > > 
> > > > After updating my laptop to yesterday's CURRENT (r265216),
> > > > I got the following fatal double fault on boot:
> > > > http://www.fabiankeil.de/bilder/freebsd/kernel-panic-r265216/
> > > > 
> > > > My previous kernel was based on r264721.
> > > >
> > > > I'm using a couple of custom patches, some of them are ZFS-related
> > > > and thus may be part of the problem (but worked fine for months).
> > > > I'll try to reproduce the panic without the patches tomorrow.
> > > >
> > > 
> > > Your seeing a stack overflow in the new ZFS queuing code, which I
> > > believe is being triggered by lack of support for TRIM in one of
> > > your devices, something Xin reported to me yesterday.
> > > 
> > > I commited a fix for failing TRIM requests processing slowly last
> > > night so you could try updating to after r265253 and see if that
> > > helps.
> > 
> > Thanks. The hard disk is indeed unlikely to support TRIM requests,
> > but I can still reproduce the problem with a kernel based on r265255.
> 
> Thanks for testing, I suspect its still a numbers game with how many items
> are outstanding in the queue and now that free / TRIM requests are also
> now queued its triggering the failure.
> 
> If your just on a HDD try setting the following in /boot/loader.conf as
> a temporary workaround:
> vfs.zfs.trim.enabled=0

That worked, thanks.

> > > I still need to investigate the stack overflow more directly which
> > > appears to be caused by the new zfs queuing code when things are
> > > running slowly and there's a large backlog of IO's.
> > >
> > > I would be interested to know you config there so zpool layout and
> > > hardware in the mean time.
> > 
> > The system is a Lenovo ThinkPad R500:
> > http://www.nycbug.org/index.cgi?action=dmesgd&do=view&dmesgid=2449
> > 
> > I'm booting from UFS, the panic occurs while the pool is being imported.
> > 
> > The pool is located on a single geli-encrypted slice:
> > 
> > fk@r500 ~ $zpool status tank
> >   pool: tank
> >  state: ONLINE
> >   scan: scrub repaired 0 in 4h11m with 0 errors on Sat Mar 22 18:25:01 2014
> > config:
> > 
> >  NAME   STATE READ WRITE CKSUM
> >  tank   ONLINE   0 0 0
> >ada0s1d.eli  ONLINE   0 0 0
> > 
> > errors: No known data errors
> > 
> > Maybe geli fails TRIM requests differently.
> 
> That helps, Xin also reported the issue with geli and thats what I'm testing
> with, I believe this is a factor because is significantly slows things down
> again meaning more items in the queues, but I've only managed to trigger it
> once here as the machine I'm using is pretty quick.

It probably doesn't make a difference, but my system is rather old
and thus I'm still using geli version 3 for ada0s1d.eli while
geli init nowadays defaults to geli version 7.

The system certainly is also slow, though.

Fabian


signature.asc
Description: PGP signature


Re: Fatal double fault in ZFS with yesterday's CURRENT

2014-05-03 Thread Steven Hartland

"Steven Hartland"  wrote:

> From: "Fabian Keil" 
> 
> > After updating my laptop to yesterday's CURRENT (r265216),

> > I got the following fatal double fault on boot:
> > http://www.fabiankeil.de/bilder/freebsd/kernel-panic-r265216/
> > 
> > My previous kernel was based on r264721.

> >
> > I'm using a couple of custom patches, some of them are ZFS-related
> > and thus may be part of the problem (but worked fine for months).
> > I'll try to reproduce the panic without the patches tomorrow.
> >
> 
> Your seeing a stack overflow in the new ZFS queuing code, which I

> believe is being triggered by lack of support for TRIM in one of
> your devices, something Xin reported to me yesterday.
> 
> I commited a fix for failing TRIM requests processing slowly last

> night so you could try updating to after r265253 and see if that
> helps.

Thanks. The hard disk is indeed unlikely to support TRIM requests,
but I can still reproduce the problem with a kernel based on r265255.


Thanks for testing, I suspect its still a numbers game with how many items
are outstanding in the queue and now that free / TRIM requests are also
now queued its triggering the failure.

If your just on a HDD try setting the following in /boot/loader.conf as
a temporary workaround:
vfs.zfs.trim.enabled=0


> I still need to investigate the stack overflow more directly which
> appears to be caused by the new zfs queuing code when things are
> running slowly and there's a large backlog of IO's.
>
> I would be interested to know you config there so zpool layout and
> hardware in the mean time.

The system is a Lenovo ThinkPad R500:
http://www.nycbug.org/index.cgi?action=dmesgd&do=view&dmesgid=2449

I'm booting from UFS, the panic occurs while the pool is being imported.

The pool is located on a single geli-encrypted slice:

fk@r500 ~ $zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 4h11m with 0 errors on Sat Mar 22 18:25:01 2014
config:

 NAME   STATE READ WRITE CKSUM
 tank   ONLINE   0 0 0
   ada0s1d.eli  ONLINE   0 0 0

errors: No known data errors

Maybe geli fails TRIM requests differently.


That helps, Xin also reported the issue with geli and thats what I'm testing
with, I believe this is a factor because is significantly slows things down
again meaning more items in the queues, but I've only managed to trigger it
once here as the machine I'm using is pretty quick.

I'll continue looking at this ASAP.

   Regards
   Steve
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Fatal double fault in ZFS with yesterday's CURRENT

2014-05-03 Thread Fabian Keil
"Steven Hartland"  wrote:

> From: "Fabian Keil" 
> 
> > After updating my laptop to yesterday's CURRENT (r265216),
> > I got the following fatal double fault on boot:
> > http://www.fabiankeil.de/bilder/freebsd/kernel-panic-r265216/
> > 
> > My previous kernel was based on r264721.
> >
> > I'm using a couple of custom patches, some of them are ZFS-related
> > and thus may be part of the problem (but worked fine for months).
> > I'll try to reproduce the panic without the patches tomorrow.
> >
> 
> Your seeing a stack overflow in the new ZFS queuing code, which I
> believe is being triggered by lack of support for TRIM in one of
> your devices, something Xin reported to me yesterday.
> 
> I commited a fix for failing TRIM requests processing slowly last
> night so you could try updating to after r265253 and see if that
> helps.

Thanks. The hard disk is indeed unlikely to support TRIM requests,
but I can still reproduce the problem with a kernel based on r265255.

> I still need to investigate the stack overflow more directly which
> appears to be caused by the new zfs queuing code when things are
> running slowly and there's a large backlog of IO's.
>
> I would be interested to know you config there so zpool layout and
> hardware in the mean time.

The system is a Lenovo ThinkPad R500:
http://www.nycbug.org/index.cgi?action=dmesgd&do=view&dmesgid=2449

I'm booting from UFS, the panic occurs while the pool is being imported.

The pool is located on a single geli-encrypted slice:

fk@r500 ~ $zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 4h11m with 0 errors on Sat Mar 22 18:25:01 2014
config:

NAME   STATE READ WRITE CKSUM
tank   ONLINE   0 0 0
  ada0s1d.eli  ONLINE   0 0 0

errors: No known data errors

Maybe geli fails TRIM requests differently.

Fabian


signature.asc
Description: PGP signature


Re: Fatal double fault in ZFS with yesterday's CURRENT

2014-05-03 Thread Steven Hartland
- Original Message - 
From: "Fabian Keil" 



After updating my laptop to yesterday's CURRENT (r265216),
I got the following fatal double fault on boot:
http://www.fabiankeil.de/bilder/freebsd/kernel-panic-r265216/

My previous kernel was based on r264721.

I'm using a couple of custom patches, some of them are ZFS-related
and thus may be part of the problem (but worked fine for months).
I'll try to reproduce the panic without the patches tomorrow.



Your seeing a stack overflow in the new ZFS queuing code, which I
believe is being triggered by lack of support for TRIM in one of
your devices, something Xin reported to me yesterday.

I commited a fix for failing TRIM requests processing slowly last
night so you could try updating to after r265253 and see if that
helps.

I still need to investigate the stack overflow more directly which
appears to be caused by the new zfs queuing code when things are
running slowly and there's a large backlog of IO's.

I would be interested to know you config there so zpool layout and
hardware in the mean time.

   Regards
   Steve
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Fatal double fault in ZFS with yesterday's CURRENT

2014-05-03 Thread Fabian Keil
After updating my laptop to yesterday's CURRENT (r265216),
I got the following fatal double fault on boot:
http://www.fabiankeil.de/bilder/freebsd/kernel-panic-r265216/

My previous kernel was based on r264721.

I'm using a couple of custom patches, some of them are ZFS-related
and thus may be part of the problem (but worked fine for months).
I'll try to reproduce the panic without the patches tomorrow.

Fabian


signature.asc
Description: PGP signature