Re: [j-nsp] mx960 crashed

2018-04-05 Thread Jonas Frey
It seems not be documented by juniper - atleast i couldnt find any (MX
related) info. However its a basic linux procedure, see:
https://en.wikipedia.org/wiki/Magic_SysRq_key

Juniper only has some info regarding SysRQ & the IDP series at:
https://kb.juniper.net/InfoCenter/index?page=content&id=KB6660&actp=MET
ADATA

-Jonas
Am Donnerstag, den 05.04.2018, 09:34 -0500 schrieb Aaron Gould:
> Thanks Rob, Is a break followed by c within 5 seconds a documented
> way to
> crash a RE-S-X6-64G ?
> 
> Btw, Jtac couldn't find the dump
> 
> Rma'ing RE ... said bad SSD on RE
> 
> -Aaron
> 
> -Original Message-
> From: Rob Foehl [mailto:r...@loonybin.net] 
> Sent: Wednesday, April 4, 2018 2:11 PM
> To: Aaron Gould
> Cc: juniper-nsp@puck.nether.net
> Subject: Re: [j-nsp] mx960 crashed
> 
> On Wed, 4 Apr 2018, Aaron Gould wrote:
> 
> > 
> > Any idea why this happened and how do I tshoot cause ?
> > 
> > login: root
> > 
> > Password:SysRq : Trigger a crash
> Looks like you're running a RE-S-X6-64G, and somehow sent it SysRq c
> --
> which is a break followed by c within 5 seconds on a serial console
> -- and
> the hypervisor dutifully crashed and wrote out a dump.  Can't really
> blame
> it for doing what it's told.
> 
> -Rob
> 
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp

signature.asc
Description: This is a digitally signed message part
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] mx960 crashed

2018-04-05 Thread Aaron Gould
Thanks Rob, Is a break followed by c within 5 seconds a documented way to
crash a RE-S-X6-64G ?

Btw, Jtac couldn't find the dump

Rma'ing RE ... said bad SSD on RE

-Aaron

-Original Message-
From: Rob Foehl [mailto:r...@loonybin.net] 
Sent: Wednesday, April 4, 2018 2:11 PM
To: Aaron Gould
Cc: juniper-nsp@puck.nether.net
Subject: Re: [j-nsp] mx960 crashed

On Wed, 4 Apr 2018, Aaron Gould wrote:

> Any idea why this happened and how do I tshoot cause ?

> login: root
>
> Password:SysRq : Trigger a crash

Looks like you're running a RE-S-X6-64G, and somehow sent it SysRq c --
which is a break followed by c within 5 seconds on a serial console -- and
the hypervisor dutifully crashed and wrote out a dump.  Can't really blame
it for doing what it's told.

-Rob

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] mx960 crashed

2018-04-04 Thread Rob Foehl

On Wed, 4 Apr 2018, Aaron Gould wrote:


Any idea why this happened and how do I tshoot cause ?



login: root

Password:SysRq : Trigger a crash


Looks like you're running a RE-S-X6-64G, and somehow sent it SysRq c -- 
which is a break followed by c within 5 seconds on a serial console -- and 
the hypervisor dutifully crashed and wrote out a dump.  Can't really blame 
it for doing what it's told.


-Rob
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] mx960 crashed

2018-04-04 Thread Aaron Gould
Thanks

 

Jtac says “DMA errors indicate a faulty SSD”  processing RMA of RE1

 

 

agould@mx960> show chassis routing-engine | grep "reason|slot|uptime"

  Slot 0:

Uptime 19 days, 23 hours, 37 minutes, 40 seconds

Last reboot reason 0x4000:VJUNOS reboot

  Slot 1:

Uptime 5 hours, 52 minutes, 22 seconds

Last reboot reason 0x800:reboot due to exception

 

 

No   /dev/mapper/   seen anywhere 

 

 

 

RE0

…

  /dev/kmemLast changed: Mar 15 14:14:49

  /dev/ksyms   Last changed: Mar 15 14:15:16

  /dev/led/Last changed: Mar 15 14:14:52

  /dev/mch Last changed: Mar 15 14:23:50

  /dev/md0 Last changed: Mar 15 14:23:50

  /dev/md0.uzipLast changed: Mar 15 14:14:49

  /dev/md1 Last changed: Mar 15 14:14:53

  /dev/md1.uzipLast changed: Mar 15 14:14:53

…

 

BU RE1

…

  /dev/ksyms   Last changed: Apr 04 08:08:38

  /dev/led/Last changed: Apr 04 08:08:16

  /dev/mch Last changed: Apr 04 08:09:06

  /dev/md0 Last changed: Apr 04 08:09:06

  /dev/md0.uzipLast changed: Apr 04 08:08:13

…

 

 

From: Graham Brown [mailto:juniper-...@grahambrown.info] 
Sent: Wednesday, April 4, 2018 12:55 PM
To: Aaron Gould
Cc: juniper-nsp@puck.nether.net
Subject: Re: [j-nsp] mx960 crashed

 

Hi Aaron,

 

Looks like you have a core file there, raise a TAC case and they’ll be able to 
determine the root cause for you. 

 

Also see if there is anything readable in 
/dev/mapper/jvg_P-jlvmrootrw/var/crash/override_reboot_reason that may point 
you to the root cause. 

 

HTH,

Graham

 

On Thu, 5 Apr 2018 at 01:13, Aaron Gould  wrote:

Still in the process of turning up this new 5-node mx960 100 gig ring, and I
went to the backup re console and went to login and saw this happen.



Any idea why this happened and how do I tshoot cause ?







FreeBSD/amd64 (mx960) (ttyu0)



login: root

Password:SysRq : Trigger a crash

dmar: DRHD: handling fault status reg 2

dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 151d9000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 102

dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 1540c000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 202

dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 152fc000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 302

dmar: DMAR:[DMA Write] Request device [06:0a.0] fault addr 1ed10d000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 402

dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: INTR-REMAP: Request device [[05:00.1] fault index 40

INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear

dmar: DRHD: handling fault status reg 602

dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 702

dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000

DMAR:[fault reason 01] Present bit in root entry is clear

  4 logical volume(s) in volume group "jvg_S" now active

  4 logical volume(s) in volume group "jvg_P" now active

Override SW Exception reboot reason saved to
/dev/mapper/jvg_P-jlvmrootrw/var/crash/override_reboot_reason

Compressed level 31 vmhost kernel core will be dumped to jvg_P-jlvmrootrw.

Please use crash utility to analyze the core

Copying data   : [ 13 %]

.

.

.

Copying data   : [100 %]



The dumpfile is saved to vmcore-0-compressed-201804041305.



makedumpfile Completed.





(many more boot messages seen)





-Aaron

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

-- 

-sent from my iPhone; please excuse spelling, grammar and brevity-

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] mx960 crashed

2018-04-04 Thread Graham Brown
Hi Aaron,

Looks like you have a core file there, raise a TAC case and they’ll be able
to determine the root cause for you.

Also see if there is anything readable in
/dev/mapper/jvg_P-jlvmrootrw/var/crash/override_reboot_reason that may
point you to the root cause.

HTH,
Graham

On Thu, 5 Apr 2018 at 01:13, Aaron Gould  wrote:

> Still in the process of turning up this new 5-node mx960 100 gig ring, and
> I
> went to the backup re console and went to login and saw this happen.
>
>
>
> Any idea why this happened and how do I tshoot cause ?
>
>
>
>
>
>
>
> FreeBSD/amd64 (mx960) (ttyu0)
>
>
>
> login: root
>
> Password:SysRq : Trigger a crash
>
> dmar: DRHD: handling fault status reg 2
>
> dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 151d9000
>
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: DRHD: handling fault status reg 102
>
> dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 1540c000
>
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: DRHD: handling fault status reg 202
>
> dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 152fc000
>
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: DRHD: handling fault status reg 302
>
> dmar: DMAR:[DMA Write] Request device [06:0a.0] fault addr 1ed10d000
>
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: DRHD: handling fault status reg 402
>
> dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000
>
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: INTR-REMAP: Request device [[05:00.1] fault index 40
>
> INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
>
> dmar: DRHD: handling fault status reg 602
>
> dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000
>
> DMAR:[fault reason 01] Present bit in root entry is clear
>
> dmar: DRHD: handling fault status reg 702
>
> dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000
>
> DMAR:[fault reason 01] Present bit in root entry is clear
>
>   4 logical volume(s) in volume group "jvg_S" now active
>
>   4 logical volume(s) in volume group "jvg_P" now active
>
> Override SW Exception reboot reason saved to
> /dev/mapper/jvg_P-jlvmrootrw/var/crash/override_reboot_reason
>
> Compressed level 31 vmhost kernel core will be dumped to jvg_P-jlvmrootrw.
>
> Please use crash utility to analyze the core
>
> Copying data   : [ 13 %]
>
> .
>
> .
>
> .
>
> Copying data   : [100 %]
>
>
>
> The dumpfile is saved to vmcore-0-compressed-201804041305.
>
>
>
> makedumpfile Completed.
>
>
>
>
>
> (many more boot messages seen)
>
>
>
>
>
> -Aaron
>
> ___
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>
-- 
-sent from my iPhone; please excuse spelling, grammar and brevity-
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp