Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
On 12/16/15, Jeff Merkey wrote: > Setting the (trap flag | resume flag) inside of an nmi handler results > in a hard lockup while setting the resume flag works fine. > > The watchdog detector fails to detect the lockup. I am currently > examining the trap gate and interrupt gate setup on Linux and if > anyone has any ideas it would be nice to be able to debug and step > through the nmi handlers. I got breakpoints to work. I noticed > kgdb/kdb just punts here and refuses to allow someone to step inside > an nmi handler. > > There is no reason Linux should not allow this to work since windows > does and every other OS out there. I have seen this across some rex64 > sysret calls as well this lockup behavior. > > Anyone who is an intel expert with any clues would love some input if > you know about this problem. > > Jeff > This bug has been located. Results from returning from NMI interrupt with trap flag set in to a userspace address as Andy suspected but its not due to the RSP value being different as he suggested. This is a separate bug from the rex64 sysret bug. Results in the NMI handler switching IDT entries if an NMI fires off in a debug stack. Ironic since the code claims it is switching stacks to enable debugging of NMI handlers and does the opposite -- breaks them. Commenting out this code gets rid of the hard lockup. The user space process that gets the trap flag and doesn't expect a trap flag just hangs (but the just that process the rest of the system keeps running). So a few bugs to run down still. NMI handlers can now be debugged -- kindof. This bug is closed and I will issue a patch for it. It's a condition where a trap flag is set inside an nmi handler that exits to a userspace address. The code for setting and clearing the trap in kernel all worked correctly for the userspace path, except it put the process to sleep when it shouldn't have. It's not a condition that can happen during normal operations unless you set the trap flag from a debugger inside an NMI handler and try to debug it then exit the handler into userspace, so I think the probability of this showing up outside a debugging session is low. I verified that kgdb/kdb also experiences this bug (If I comment out the code blocking folks from debugging NMI handlers with kgdb/kdb). Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
On 12/16/15, Jeff Merkey wrote: > On 12/16/15, Jeff Merkey wrote: >> Setting the (trap flag | resume flag) inside of an nmi handler results >> in a hard lockup while setting the resume flag works fine. >> >> The watchdog detector fails to detect the lockup. I am currently >> examining the trap gate and interrupt gate setup on Linux and if >> anyone has any ideas it would be nice to be able to debug and step >> through the nmi handlers. I got breakpoints to work. I noticed >> kgdb/kdb just punts here and refuses to allow someone to step inside >> an nmi handler. >> >> There is no reason Linux should not allow this to work since windows >> does and every other OS out there. I have seen this across some rex64 >> sysret calls as well this lockup behavior. >> >> Anyone who is an intel expert with any clues would love some input if >> you know about this problem. >> >> Jeff >> > > More info. Linux is getting a trap and it looks like the IDT is > getting swapped when it gets it -- POW - Dead Linux. > > Damn ... Well, it will be long night and lots of builds ... > > Jeff > >He is stepping into native_safe_halt() when this bug occurs (processor >has halted). I am starting to wonder is this is a linux bug or intel >bug. I am starting to lean towards intel bug possibly. I will go and >review intels documentation about what happens when a processor has >been halted, is triggered with an NMI, then someone reloads the >processor with the trap flag set then returns to a hlt instruction. >Wow, this fucking cool .. I can debug Linus' "I halt when idle" core >function in the linux scheduler. There is still a bug in there and its not that instruction. I got the halt instruction to work. It has something to do with how linux handles trap and interrupt gates if someone sets a trap flag.Seems pretty random as to where I see it. Still working. I will submit a patch when I find it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
He is stepping into native_safe_halt() when this bug occurs (processor has halted). I am starting to wonder is this is a linux bug or intel bug. I am starting to lean towards intel bug possibly. I will go and review intels documentation about what happens when a processor has been halted, is triggered with an NMI, then someone reloads the processor with the trap flag set then returns to a hlt instruction. Wow, this fucking cool .. I can debug Linus' "I halt when idle" core function in the linux scheduler. On 12/16/15, Jeff Merkey wrote: > On 12/16/15, Jeff Merkey wrote: >> Setting the (trap flag | resume flag) inside of an nmi handler results >> in a hard lockup while setting the resume flag works fine. >> >> The watchdog detector fails to detect the lockup. I am currently >> examining the trap gate and interrupt gate setup on Linux and if >> anyone has any ideas it would be nice to be able to debug and step >> through the nmi handlers. I got breakpoints to work. I noticed >> kgdb/kdb just punts here and refuses to allow someone to step inside >> an nmi handler. >> >> There is no reason Linux should not allow this to work since windows >> does and every other OS out there. I have seen this across some rex64 >> sysret calls as well this lockup behavior. >> >> Anyone who is an intel expert with any clues would love some input if >> you know about this problem. >> >> Jeff >> > > More info. Linux is getting a trap and it looks like the IDT is > getting swapped when it gets it -- POW - Dead Linux. > > Damn ... Well, it will be long night and lots of builds ... > > Jeff > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
On 12/16/15, Jeff Merkey wrote: > Setting the (trap flag | resume flag) inside of an nmi handler results > in a hard lockup while setting the resume flag works fine. > > The watchdog detector fails to detect the lockup. I am currently > examining the trap gate and interrupt gate setup on Linux and if > anyone has any ideas it would be nice to be able to debug and step > through the nmi handlers. I got breakpoints to work. I noticed > kgdb/kdb just punts here and refuses to allow someone to step inside > an nmi handler. > > There is no reason Linux should not allow this to work since windows > does and every other OS out there. I have seen this across some rex64 > sysret calls as well this lockup behavior. > > Anyone who is an intel expert with any clues would love some input if > you know about this problem. > > Jeff > More info. Linux is getting a trap and it looks like the IDT is getting swapped when it gets it -- POW - Dead Linux. Damn ... Well, it will be long night and lots of builds ... Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
Setting the (trap flag | resume flag) inside of an nmi handler results in a hard lockup while setting the resume flag works fine. The watchdog detector fails to detect the lockup. I am currently examining the trap gate and interrupt gate setup on Linux and if anyone has any ideas it would be nice to be able to debug and step through the nmi handlers. I got breakpoints to work. I noticed kgdb/kdb just punts here and refuses to allow someone to step inside an nmi handler. There is no reason Linux should not allow this to work since windows does and every other OS out there. I have seen this across some rex64 sysret calls as well this lockup behavior. Anyone who is an intel expert with any clues would love some input if you know about this problem. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
Setting the (trap flag | resume flag) inside of an nmi handler results in a hard lockup while setting the resume flag works fine. The watchdog detector fails to detect the lockup. I am currently examining the trap gate and interrupt gate setup on Linux and if anyone has any ideas it would be nice to be able to debug and step through the nmi handlers. I got breakpoints to work. I noticed kgdb/kdb just punts here and refuses to allow someone to step inside an nmi handler. There is no reason Linux should not allow this to work since windows does and every other OS out there. I have seen this across some rex64 sysret calls as well this lockup behavior. Anyone who is an intel expert with any clues would love some input if you know about this problem. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
On 12/16/15, Jeff Merkeywrote: > Setting the (trap flag | resume flag) inside of an nmi handler results > in a hard lockup while setting the resume flag works fine. > > The watchdog detector fails to detect the lockup. I am currently > examining the trap gate and interrupt gate setup on Linux and if > anyone has any ideas it would be nice to be able to debug and step > through the nmi handlers. I got breakpoints to work. I noticed > kgdb/kdb just punts here and refuses to allow someone to step inside > an nmi handler. > > There is no reason Linux should not allow this to work since windows > does and every other OS out there. I have seen this across some rex64 > sysret calls as well this lockup behavior. > > Anyone who is an intel expert with any clues would love some input if > you know about this problem. > > Jeff > More info. Linux is getting a trap and it looks like the IDT is getting swapped when it gets it -- POW - Dead Linux. Damn ... Well, it will be long night and lots of builds ... Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
On 12/16/15, Jeff Merkeywrote: > On 12/16/15, Jeff Merkey wrote: >> Setting the (trap flag | resume flag) inside of an nmi handler results >> in a hard lockup while setting the resume flag works fine. >> >> The watchdog detector fails to detect the lockup. I am currently >> examining the trap gate and interrupt gate setup on Linux and if >> anyone has any ideas it would be nice to be able to debug and step >> through the nmi handlers. I got breakpoints to work. I noticed >> kgdb/kdb just punts here and refuses to allow someone to step inside >> an nmi handler. >> >> There is no reason Linux should not allow this to work since windows >> does and every other OS out there. I have seen this across some rex64 >> sysret calls as well this lockup behavior. >> >> Anyone who is an intel expert with any clues would love some input if >> you know about this problem. >> >> Jeff >> > > More info. Linux is getting a trap and it looks like the IDT is > getting swapped when it gets it -- POW - Dead Linux. > > Damn ... Well, it will be long night and lots of builds ... > > Jeff > >He is stepping into native_safe_halt() when this bug occurs (processor >has halted). I am starting to wonder is this is a linux bug or intel >bug. I am starting to lean towards intel bug possibly. I will go and >review intels documentation about what happens when a processor has >been halted, is triggered with an NMI, then someone reloads the >processor with the trap flag set then returns to a hlt instruction. >Wow, this fucking cool .. I can debug Linus' "I halt when idle" core >function in the linux scheduler. There is still a bug in there and its not that instruction. I got the halt instruction to work. It has something to do with how linux handles trap and interrupt gates if someone sets a trap flag.Seems pretty random as to where I see it. Still working. I will submit a patch when I find it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
He is stepping into native_safe_halt() when this bug occurs (processor has halted). I am starting to wonder is this is a linux bug or intel bug. I am starting to lean towards intel bug possibly. I will go and review intels documentation about what happens when a processor has been halted, is triggered with an NMI, then someone reloads the processor with the trap flag set then returns to a hlt instruction. Wow, this fucking cool .. I can debug Linus' "I halt when idle" core function in the linux scheduler. On 12/16/15, Jeff Merkeywrote: > On 12/16/15, Jeff Merkey wrote: >> Setting the (trap flag | resume flag) inside of an nmi handler results >> in a hard lockup while setting the resume flag works fine. >> >> The watchdog detector fails to detect the lockup. I am currently >> examining the trap gate and interrupt gate setup on Linux and if >> anyone has any ideas it would be nice to be able to debug and step >> through the nmi handlers. I got breakpoints to work. I noticed >> kgdb/kdb just punts here and refuses to allow someone to step inside >> an nmi handler. >> >> There is no reason Linux should not allow this to work since windows >> does and every other OS out there. I have seen this across some rex64 >> sysret calls as well this lockup behavior. >> >> Anyone who is an intel expert with any clues would love some input if >> you know about this problem. >> >> Jeff >> > > More info. Linux is getting a trap and it looks like the IDT is > getting swapped when it gets it -- POW - Dead Linux. > > Damn ... Well, it will be long night and lots of builds ... > > Jeff > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP
On 12/16/15, Jeff Merkeywrote: > Setting the (trap flag | resume flag) inside of an nmi handler results > in a hard lockup while setting the resume flag works fine. > > The watchdog detector fails to detect the lockup. I am currently > examining the trap gate and interrupt gate setup on Linux and if > anyone has any ideas it would be nice to be able to debug and step > through the nmi handlers. I got breakpoints to work. I noticed > kgdb/kdb just punts here and refuses to allow someone to step inside > an nmi handler. > > There is no reason Linux should not allow this to work since windows > does and every other OS out there. I have seen this across some rex64 > sysret calls as well this lockup behavior. > > Anyone who is an intel expert with any clues would love some input if > you know about this problem. > > Jeff > This bug has been located. Results from returning from NMI interrupt with trap flag set in to a userspace address as Andy suspected but its not due to the RSP value being different as he suggested. This is a separate bug from the rex64 sysret bug. Results in the NMI handler switching IDT entries if an NMI fires off in a debug stack. Ironic since the code claims it is switching stacks to enable debugging of NMI handlers and does the opposite -- breaks them. Commenting out this code gets rid of the hard lockup. The user space process that gets the trap flag and doesn't expect a trap flag just hangs (but the just that process the rest of the system keeps running). So a few bugs to run down still. NMI handlers can now be debugged -- kindof. This bug is closed and I will issue a patch for it. It's a condition where a trap flag is set inside an nmi handler that exits to a userspace address. The code for setting and clearing the trap in kernel all worked correctly for the userspace path, except it put the process to sleep when it shouldn't have. It's not a condition that can happen during normal operations unless you set the trap flag from a debugger inside an NMI handler and try to debug it then exit the handler into userspace, so I think the probability of this showing up outside a debugging session is low. I verified that kgdb/kdb also experiences this bug (If I comment out the code blocking folks from debugging NMI handlers with kgdb/kdb). Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/