Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-08-05 Thread Robert Mustacchi
Hi Daniel,

I suspect that you're trying to say that you're seeing a large amount of
crashes with something that seems like the change blow is related?

Do you have any crash dumps in /var/crash/volatile that you can share so
we can help debug this?

Thanks,
Robert

On 8/5/17 0:59 , Daniel Plominski wrote:
>  
> 
> https://github.com/illumos/illumos-gate/commit/b81db1e8f4fb4ce1e3bf7f8053643f62803cf4fe
> 
>  
> 
> https://us-east.manta.joyent.com/Joyent_Dev/public/builds/smartos/release-20170803-20170803T064301Z/smartos//changelog.txt
> 
> 
>  
> 
> Mit freundlichen Grüßen
> 
>  
> 
>  
> 
> *DANIEL PLOMINSKI*
> 
> Leiter – IT / Head of IT
> 
>  
> 
> Telefon 09265 808-151  |  Mobil 0151 58026316  |  d...@ass.de
> 
> 
> PGP Key: http://pgp.ass.de/2B4EB20A.key
> 
>  
> 
>  
> 
> cid:C17DB6FB-5F79-4BCC-AAB4-CAB59266BC29@localdomain
> 
>  
> 
> ASS-Einrichtungssysteme GmbH
> 
> ASS-Adam-Stegner-Straße 19  |  D-96342 Stockheim
> 
>  
> 
> Geschäftsführer: Matthias Stegner, Michael Stegner, Stefan Weiß
> 
> Amtsgericht Coburg HRB 3395  |  Ust-ID: DE218715721
> 
>  
> 
> cid:E40AEC87-91EE-472A-901A-ECAD3F5801FB@localdomain
> 
>  
> 
> *smartos-discuss* | Archives
> 
>  |
> Modify
> 
> Your Subscription [Powered by Listbox] 
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-07-19 Thread Robert Mustacchi
On 7/19/17 10:55 , Alex Kritikos wrote:
> Hi Robert,
> 
> The output is (truncated to only that NIC)
> 
> LINK PROPERTYPERM VALUE  DEFAULTPOSSIBLE
> 
> sfxge0   mtu rw   1500   1500   1500
> sfxge1   mtu rw   1500   1500   1500
> 
> 
> I am afraid I don’t have a special relationship with them but lets first
> figure what is going on.

OK, from looking at the driver source, I think this is a bug of sorts.
We don't actually report what the MTU range is here.
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/sfxge/sfxge_gld_v3.c#1152
is the relevant source. So that's unfortunate. We should get that fixed.

What happens if you run dladm set-linkprop -t -p mtu=9000 sfxge0?

Robert


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-07-19 Thread Alex Kritikos
Hi Robert,

The output is (truncated to only that NIC)

LINK PROPERTYPERM VALUE  DEFAULTPOSSIBLE

sfxge0   mtu rw   1500   1500   1500
sfxge1   mtu rw   1500   1500   1500


I am afraid I don’t have a special relationship with them but lets first
figure what is going on.

Alex



On 19 July 2017 at 20:32:54, Robert Mustacchi (r...@joyent.com) wrote:

On 7/19/17 5:03 , Alex Kritikos wrote:
> Hello list,
>
> I am running the latest release of SmartOS / Triton and I have a
Solaflare
> NIC. While this is now correctly detected it seems to allow a max MTU of
> 1500. Looking at the latest Solarflare Solaris drivers it appears that
the
> latest version allows a max MTU of 9000. I am trying to setup an overlay
> network for triton over that NIC so I am currently stuck.

Hi Alex,

What does dladm show-linkprop -p mtu show for that device? From my read
of the driver source code, the maximum MTU should be 9202 for the
Solarflare cards.

> Is there any plan to update the solar flare nic driver? Is it possible to
> do this myself somehow?
 
 In this case, I think we should better understand what's going on. That
 said, more generally, we're not in a position to update the driver as it
 was written by Solarflare and at least, at Joyent, we don't have
 documentation or specifications to do the work. If you have a
 relationship with Solarflare it may be worth talking to them about this,
 though we're happy to help them (and other vendors) as we can.
 
 That said, depending on what this is, we can still potentially make
 changes without that information.
 
 Robert
 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-07-19 Thread Robert Mustacchi
On 7/19/17 5:03 , Alex Kritikos wrote:
> Hello list,
> 
> I am running the latest release of SmartOS / Triton and I have a Solaflare
> NIC. While this is now correctly detected it seems to allow a max MTU of
> 1500. Looking at the latest Solarflare Solaris drivers it appears that the
> latest version allows a max MTU of 9000. I am trying to setup an overlay
> network for triton over that NIC so I am currently stuck.

Hi Alex,

What does dladm show-linkprop -p mtu show for that device? From my read
of the driver source code, the maximum MTU should be 9202 for the
Solarflare cards.

> Is there any plan to update the solar flare nic driver? Is it possible to
> do this myself somehow?

In this case, I think we should better understand what's going on. That
said, more generally, we're not in a position to update the driver as it
was written by Solarflare and at least, at Joyent, we don't have
documentation or specifications to do the work. If you have a
relationship with Solarflare it may be worth talking to them about this,
though we're happy to help them (and other vendors) as we can.

That said, depending on what this is, we can still potentially make
changes without that information.

Robert


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-07-19 Thread InterNetX - Juergen Gotteswinter


Am 19.07.2017 um 14:03 schrieb Steven Williamson:
> If the crash appears different every time check for hardware issues ?

we are pretty sure that the gm binary from GraphicsMagick causes this
Issue, theres nothing else strange on this host.

> 
> Any memory errors or such logged in the OOB management logs, or see if> 
> anything is reported in "fmdump -ev" ?

nope, nothing :| already checked

fmdump got some entries

[14:39:13][root@imgconvert-vmhost:/var/crash/volatile]$fmdump -ev
TIME CLASS ENA
Jul 13 09:40:13.4895 ereport.fm.fmd.module
0x80c1cdc3ff906801
Jul 13 10:58:13.8285 ereport.fm.fmd.module
0xc3e3d9ad20f06801
Jul 13 11:51:52.3469 ereport.fm.fmd.module
0xf2b9c8f7efc06801
Jul 13 12:40:21.9546 ereport.fm.fmd.module
0x1d10efdc5f106801
Jul 13 12:59:29.0118 ereport.fm.fmd.module
0x2ebc3a50d9a06801
Jul 14 17:52:27.2049 ereport.fm.fmd.module
0x16d81be693106801
Jul 14 22:49:44.1068 ereport.fm.fmd.module
0x19e9d25a42506801
Jul 15 14:22:07.9198 ereport.fm.fmd.module
0x487f6cbddb106801
Jul 15 23:01:59.0521 ereport.fm.fmd.module
0x0f666b8d2df06801
Jul 16 13:33:48.8724 ereport.fm.fmd.module
0x07925b1aafb06801
Jul 17 05:40:04.5401 ereport.fm.fmd.module
0x5342638f87106801
Jul 17 13:48:02.1728 ereport.fm.fmd.module
0xfd470998f1e06801
Jul 18 07:29:13.8224 ereport.fm.fmd.module
0x9ce55d18be106801
Jul 18 14:58:38.7127 ereport.fm.fmd.module
0x254ab8d50cf06801
Jul 19 09:11:04.6919 ereport.fm.fmd.module
0xdd8b5e6116506801
Jul 19 12:51:41.4838 ereport.fm.fmd.module
0x35e20f9af5d06801
[14:39:21][root@imgconvert-vmhost:/var/crash/volatile]$

Jul 13 09:40:13.5637 6deea02a-d655-48d3-c39c-c38a6d1c5158 FMD-8000-2K
Diagnosed
  100%  defect.sunos.fmd.module

Problem in: -
   Affects: fmd:///module/software-diagnosis
   FRU: -
  Location: -


---   --
-
TIMEEVENT-ID  MSG-ID
SEVERITY
---   --
-
Jul 13 09:40:13 6deea02a-d655-48d3-c39c-c38a6d1c5158  FMD-8000-2KMinor


Platform: X10DRiChassis_id  : 123456789
Product_sn  :

Fault class : defect.sunos.fmd.module
Affects : fmd:///module/software-diagnosis
  faulted and taken out of service
FRU : None
  faulty

Description : An illumos Fault Manager component has experienced an
error that
  required the module to be disabled.  Refer to
  http://illumos.org/msg/FMD-8000-2K for more information.

Response: The module has been disabled.  Events destined for the module
  will be saved for manual diagnosis.

Impact  : Automated diagnosis and response for subsequent events
associated
  with this module will not occur.

Action  : Use fmdump -v -u  to locate the module.  Use fmadm
  reset  to reset the module.


> 
> Worth checking before digging into the dumps.
> 
> On Wed, 19 Jul 2017 at 13:00 Jerry Jelinek  > wrote:
> 
> I see that you are running a recent platform build
> (joyent_20170710T035256Z) so this does not appear to be a known bug.
> I filed OS-6238 to track this issue. We will want to get a copy of
> your dump so that we can fully debug this. Please contact me
> directly so we can arrange that.
> 
> Thanks for reporting this and sorry for the problem,
> Jerry
> 
> 
> On Wed, Jul 19, 2017 at 2:46 AM, InterNetX - Juergen Gotteswinter
> mailto:j...@internetx.com>> wrote:
> 
> Hello List,
> 
> we are facing an issue with GraphicsMagic Convert Jobs inside an
> CentOS
> 7 LX Branded Zone.
> 
> Inside the zone tons of pictures get converted via GraphicsMagick in
> batch jobs (proc count ~80 usually). Every few hours, the whole
> system
> panics, as far as my mdb skills tell me its not always the same
> reason.
> 
> Maybe someone can take a look at this, currently we are somehow at a
> dead end. Full Core Dump Files can be supplied if needed.
> 
> Thanks!
> 
> Juergen
> 
> > ::status
> debugging crash dump vmcore.14 (64-bit) from
> operating system: 5.11 joyent_20170710T035256Z (i86pc)
> image uuid: (not set)
> panic message: mutex_enter: bad mutex, lp=d0198b416018
> owner=d01989e9a8c0 thread=d01985f477c0
> dump content: kernel pages only
> >
> 
> 
> > ::stack
> vpanic()
> mutex_panic+0x58(fb952692, d0198b416018)
> mutex_vector_enter+0x347(d0198b416018)
> priv_proc_cred_perm+0x48(d01971187008, d0198b416000, 0, 40)
> lxpr_doaccess+0xe0(d0196fa06378, 0, 40, 0, d01971187008, 0)
> lxpr_access+0x31(d0196cf5d200, 40, 0, d01971187008, 0)
> lxpr_lookup+0x59(d0196cf

Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-07-19 Thread Alex Kritikos
Hello list,

I am running the latest release of SmartOS / Triton and I have a Solaflare
NIC. While this is now correctly detected it seems to allow a max MTU of
1500. Looking at the latest Solarflare Solaris drivers it appears that the
latest version allows a max MTU of 9000. I am trying to setup an overlay
network for triton over that NIC so I am currently stuck.

Is there any plan to update the solar flare nic driver? Is it possible to
do this myself somehow?

Many thanks in advance,

Alex Kritikos



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-07-19 Thread Steven Williamson
If the crash appears different every time check for hardware issues ?

Any memory errors or such logged in the OOB management logs, or see if
anything is reported in "fmdump -ev" ?

Worth checking before digging into the dumps.

On Wed, 19 Jul 2017 at 13:00 Jerry Jelinek  wrote:

> I see that you are running a recent platform build (joyent_20170710T035256Z)
> so this does not appear to be a known bug. I filed OS-6238 to track this
> issue. We will want to get a copy of your dump so that we can fully debug
> this. Please contact me directly so we can arrange that.
>
> Thanks for reporting this and sorry for the problem,
> Jerry
>
>
> On Wed, Jul 19, 2017 at 2:46 AM, InterNetX - Juergen Gotteswinter <
> j...@internetx.com> wrote:
>
>> Hello List,
>>
>> we are facing an issue with GraphicsMagic Convert Jobs inside an CentOS
>> 7 LX Branded Zone.
>>
>> Inside the zone tons of pictures get converted via GraphicsMagick in
>> batch jobs (proc count ~80 usually). Every few hours, the whole system
>> panics, as far as my mdb skills tell me its not always the same reason.
>>
>> Maybe someone can take a look at this, currently we are somehow at a
>> dead end. Full Core Dump Files can be supplied if needed.
>>
>> Thanks!
>>
>> Juergen
>>
>> > ::status
>> debugging crash dump vmcore.14 (64-bit) from
>> operating system: 5.11 joyent_20170710T035256Z (i86pc)
>> image uuid: (not set)
>> panic message: mutex_enter: bad mutex, lp=d0198b416018
>> owner=d01989e9a8c0 thread=d01985f477c0
>> dump content: kernel pages only
>> >
>>
>>
>> > ::stack
>> vpanic()
>> mutex_panic+0x58(fb952692, d0198b416018)
>> mutex_vector_enter+0x347(d0198b416018)
>> priv_proc_cred_perm+0x48(d01971187008, d0198b416000, 0, 40)
>> lxpr_doaccess+0xe0(d0196fa06378, 0, 40, 0, d01971187008, 0)
>> lxpr_access+0x31(d0196cf5d200, 40, 0, d01971187008, 0)
>> lxpr_lookup+0x59(d0196cf5d200, d00080f92980, d00080f92978,
>> d00080f92bd0, 0, d019695dcc80)
>> fop_lookup+0xa3(d0196cf5d200, d00080f92980, d00080f92978,
>> d00080f92bd0, 0, d019695dcc80)
>> lookuppnvp+0x230(d00080f92bd0, 0, 0, 0, d00080f92de0,
>> d019695dcc80)
>> lookuppnatcred+0x176(d00080f92bd0, 0, 0, 0, d00080f92de0, 0)
>> lookupnameatcred+0xdd(7feff730, 0, 0, 0, d00080f92de0, 0)
>> lookupnameat+0x39(7feff730, 0, 0, 0, d00080f92de0, 0)
>> readlinkat+0x9e(ffd19553, 7feff730, 7f012440, 7f)
>> lx_readlink+0x2c(7feff730, 7f012440, 7f)
>> lx_syscall_enter+0x16f()
>> sys_syscall+0x142()
>> >
>>
>> ::msgbuf
>>
>> 
>> 
>>
>> bpf0 is /pseudo/bpf@0
>> pseudo-device: pm0
>> pm0 is /pseudo/pm@0
>> pseudo-device: nsmb0
>> nsmb0 is /pseudo/nsmb@0
>> pseudo-device: lx_systrace0
>> lx_systrace0 is /pseudo/lx_systrace@0
>> NOTICE: vnic1011 unregistered
>>
>> panic[cpu23]/thread=d01985f477c0:
>> mutex_enter: bad mutex, lp=d0198b416018 owner=d01989e9a8c0
>> thread=d01985f477c0
>>
>>
>> d00080f925a0 unix:mutex_panic+58 ()
>> d00080f92610 unix:mutex_vector_enter+347 ()
>> d00080f92680 genunix:priv_proc_cred_perm+48 ()
>> d00080f92710 lx_proc:lxpr_doaccess+e0 ()
>> d00080f92750 lx_proc:lxpr_access+31 ()
>> d00080f927d0 lx_proc:lxpr_lookup+59 ()
>> d00080f92880 genunix:fop_lookup+a3 ()
>> d00080f92af0 genunix:lookuppnvp+230 ()
>> d00080f92b90 genunix:lookuppnatcred+176 ()
>> d00080f92ca0 genunix:lookupnameatcred+dd ()
>> d00080f92cf0 genunix:lookupnameat+39 ()
>> d00080f92e40 genunix:readlinkat+9e ()
>> d00080f92e70 lx_brand:lx_readlink+2c ()
>> d00080f92ef0 lx_brand:lx_syscall_enter+16f ()
>> d00080f92f10 unix:brand_sys_syscall+1bd ()
>>
>>
>> > ::panicinfo
>>  cpu   23
>>   thread d01985f477c0
>>  message mutex_enter: bad mutex, lp=d0198b416018
>> owner=d01989e9a8c0 thread=d01985f477c0
>>  rdi fb95265f
>>  rsi d00080f92520
>>  rdx d0198b416018
>>  rcx d01989e9a8c0
>>   r8 d01985f477c0
>>   r9 d00080f925a0
>>  rax d00080f92540
>>  rbx d0198b416018
>>  rbp d00080f92580
>>  r10 d01986eb87a0
>>  r11 d01985f477c0
>>  r120
>>  r130
>>  r14 d0198b416000
>>  r15   40
>>   fsbase0
>>   gsbase d01947944580
>>   ds   38
>>   es0
>>   fs0
>>   gs0
>>   trapno0
>>  err0
>>  rip fb863660
>>   cs   30
>>   rflags  282
>>  rsp d00080f92518
>>   ss   38
>>   gdt_hi0
>> 

Re: [smartos-discuss] random host crashes during high proc count & load / lx zone

2017-07-19 Thread Jerry Jelinek
I see that you are running a recent platform build (joyent_20170710T035256Z)
so this does not appear to be a known bug. I filed OS-6238 to track this
issue. We will want to get a copy of your dump so that we can fully debug
this. Please contact me directly so we can arrange that.

Thanks for reporting this and sorry for the problem,
Jerry


On Wed, Jul 19, 2017 at 2:46 AM, InterNetX - Juergen Gotteswinter <
j...@internetx.com> wrote:

> Hello List,
>
> we are facing an issue with GraphicsMagic Convert Jobs inside an CentOS
> 7 LX Branded Zone.
>
> Inside the zone tons of pictures get converted via GraphicsMagick in
> batch jobs (proc count ~80 usually). Every few hours, the whole system
> panics, as far as my mdb skills tell me its not always the same reason.
>
> Maybe someone can take a look at this, currently we are somehow at a
> dead end. Full Core Dump Files can be supplied if needed.
>
> Thanks!
>
> Juergen
>
> > ::status
> debugging crash dump vmcore.14 (64-bit) from
> operating system: 5.11 joyent_20170710T035256Z (i86pc)
> image uuid: (not set)
> panic message: mutex_enter: bad mutex, lp=d0198b416018
> owner=d01989e9a8c0 thread=d01985f477c0
> dump content: kernel pages only
> >
>
>
> > ::stack
> vpanic()
> mutex_panic+0x58(fb952692, d0198b416018)
> mutex_vector_enter+0x347(d0198b416018)
> priv_proc_cred_perm+0x48(d01971187008, d0198b416000, 0, 40)
> lxpr_doaccess+0xe0(d0196fa06378, 0, 40, 0, d01971187008, 0)
> lxpr_access+0x31(d0196cf5d200, 40, 0, d01971187008, 0)
> lxpr_lookup+0x59(d0196cf5d200, d00080f92980, d00080f92978,
> d00080f92bd0, 0, d019695dcc80)
> fop_lookup+0xa3(d0196cf5d200, d00080f92980, d00080f92978,
> d00080f92bd0, 0, d019695dcc80)
> lookuppnvp+0x230(d00080f92bd0, 0, 0, 0, d00080f92de0,
> d019695dcc80)
> lookuppnatcred+0x176(d00080f92bd0, 0, 0, 0, d00080f92de0, 0)
> lookupnameatcred+0xdd(7feff730, 0, 0, 0, d00080f92de0, 0)
> lookupnameat+0x39(7feff730, 0, 0, 0, d00080f92de0, 0)
> readlinkat+0x9e(ffd19553, 7feff730, 7f012440, 7f)
> lx_readlink+0x2c(7feff730, 7f012440, 7f)
> lx_syscall_enter+0x16f()
> sys_syscall+0x142()
> >
>
> ::msgbuf
>
> 
> 
>
> bpf0 is /pseudo/bpf@0
> pseudo-device: pm0
> pm0 is /pseudo/pm@0
> pseudo-device: nsmb0
> nsmb0 is /pseudo/nsmb@0
> pseudo-device: lx_systrace0
> lx_systrace0 is /pseudo/lx_systrace@0
> NOTICE: vnic1011 unregistered
>
> panic[cpu23]/thread=d01985f477c0:
> mutex_enter: bad mutex, lp=d0198b416018 owner=d01989e9a8c0
> thread=d01985f477c0
>
>
> d00080f925a0 unix:mutex_panic+58 ()
> d00080f92610 unix:mutex_vector_enter+347 ()
> d00080f92680 genunix:priv_proc_cred_perm+48 ()
> d00080f92710 lx_proc:lxpr_doaccess+e0 ()
> d00080f92750 lx_proc:lxpr_access+31 ()
> d00080f927d0 lx_proc:lxpr_lookup+59 ()
> d00080f92880 genunix:fop_lookup+a3 ()
> d00080f92af0 genunix:lookuppnvp+230 ()
> d00080f92b90 genunix:lookuppnatcred+176 ()
> d00080f92ca0 genunix:lookupnameatcred+dd ()
> d00080f92cf0 genunix:lookupnameat+39 ()
> d00080f92e40 genunix:readlinkat+9e ()
> d00080f92e70 lx_brand:lx_readlink+2c ()
> d00080f92ef0 lx_brand:lx_syscall_enter+16f ()
> d00080f92f10 unix:brand_sys_syscall+1bd ()
>
>
> > ::panicinfo
>  cpu   23
>   thread d01985f477c0
>  message mutex_enter: bad mutex, lp=d0198b416018
> owner=d01989e9a8c0 thread=d01985f477c0
>  rdi fb95265f
>  rsi d00080f92520
>  rdx d0198b416018
>  rcx d01989e9a8c0
>   r8 d01985f477c0
>   r9 d00080f925a0
>  rax d00080f92540
>  rbx d0198b416018
>  rbp d00080f92580
>  r10 d01986eb87a0
>  r11 d01985f477c0
>  r120
>  r130
>  r14 d0198b416000
>  r15   40
>   fsbase0
>   gsbase d01947944580
>   ds   38
>   es0
>   fs0
>   gs0
>   trapno0
>  err0
>  rip fb863660
>   cs   30
>   rflags  282
>  rsp d00080f92518
>   ss   38
>   gdt_hi0
>   gdt_lo b1ef
>   idt_hi0
>   idt_lo afff
>  ldt0
> task   70
>  cr0 8005003b
>  cr2 7fff84009000
>  cr36203d1000
>  cr4   3426f8
> >
>
>
>
>
> > ::memstat
>
>
> Page SummaryPagesMB  %Tot
> --