Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review
On Tue, Aug 14, 2018 at 7:34 PM, Joel Sherrill wrote: > > > On Sun, Aug 12, 2018 at 4:47 PM, Amaan Cheval > wrote: >> >> Hi! >> >> I've narrowed the issue down to this bintime function: >> >> https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe3411937181/cpukit/score/src/kern_tc.c#L548 >> >> The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at >> "1000", when that function is called (rtems_clock_get_tod -> >> _TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and >> tvp values there are: >> >> (gdb) p bt >> $2 = {sec = 599562004, frac = 18446744073709551536} >> (gdb) p *tvp >> $3 = {tv_sec = 599562004, tv_usec = 99} >> >> The full (relevant) debug log for the "wrong" timing despite the >> Clock_driver_ticks being correct is here: >> https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54 >> >> I'm quite unfamiliar with how the low-level internals work and it >> looks like it comes from FreeBSD. This is likely a bug from the >> timecounter being "too" precise - it dispatches the task at _exactly_ >> the tc_freq it promised - if it slips by 1 tick, then the values start >> looking correct. >> >> This looks more like an off-by-one in the low-level code, in that >> case, since my clock driver's timecounter returns exactly the value it >> ought to be returning (100 when 1 second has passed, for eg., when the >> tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec" >> value in clockgettod.c causes the wrong second to be set in >> "time_buffer"). >> >> >> https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c#L51 >> >> On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval >> wrote: >> > There's another issue I'm having now: >> > >> > At -O0, ticker.exe works well and passes reliably. At -O2, the TOD >> > seems to be rushed a bit: >> > >> > TA1 - rtems_clock_get_tod - 09:00:00 12/31/1988 >> > TA2 - rtems_clock_get_tod - 09:00:00 12/31/1988 >> > TA3 - rtems_clock_get_tod - 09:00:00 12/31/1988 >> > TA1 - rtems_clock_get_tod - 09:00:04 12/31/1988 >> > TA2 - rtems_clock_get_tod - 09:00:09 12/31/1988 >> > TA1 - rtems_clock_get_tod - 09:00:09 12/31/1988 >> > TA3 - rtems_clock_get_tod - 09:00:14 12/31/1988 >> > TA1 - rtems_clock_get_tod - 09:00:14 12/31/1988 >> > TA2 - rtems_clock_get_tod - 09:00:19 12/31/1988 >> > TA1 - rtems_clock_get_tod - 09:00:19 12/31/1988 >> > TA1 - rtems_clock_get_tod - 09:00:24 12/31/1988 >> > TA3 - rtems_clock_get_tod - 09:00:29 12/31/1988 >> > TA2 - rtems_clock_get_tod - 09:00:29 12/31/1988 >> > TA1 - rtems_clock_get_tod - 09:00:29 12/31/1988 >> > TA1 - rtems_clock_get_tod - 09:00:34 12/31/1988 >> > >> > I'm not sure what it could be - I suspected my get_timecount somehow >> > not realizing that Clock_driver_ticks was volatile, but that seems to >> > be in order. The relevant code is here: >> > >> > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c >> > >> > On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval >> > wrote: >> >> Figured it out; turns out my code to align the stack so I could make >> >> calls without raising exceptions was messing up and corrupting the >> >> stack-pointer. >> >> >> >> Running the -O2 code now makes the clock run a bit too quickly - the >> >> calibration may have a minor issue. I'll fix that up and send patches >> >> tomorrow or Monday hopefully. >> >> >> >> I'll be traveling Tuesday, so I'd appreciate if we can get them merged >> >> upstream Monday itself - I'm okay to have a call and walk someone >> >> through the patches and whatnot if need be. >> >> >> >> Cheers! >> >> >> >> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval >> >> wrote: >> >>> Hi! >> >>> >> >>> In the process of cleaning my work up, I've run into an odd problem >> >>> which only shows up when I set the optimization level to -O2. At -O0, >> >>> it's perfectly fine. >> >>> >> >>> The issue is that somehow execution ends up at address 0x0. >> >>> >> >>> This likely happens due to a _CPU_Context_switch, where the rsp is set >> >>> to a corrupted value, leading to a corrupt (i.e. 0) return address at >> >>> the end of the context switch. >> >>> >> >>> What's curious is that this corruption _seems_ to occur in >> >>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value >> >>> of rsp up - I honestly don't know this for sure because gdb says one >> >>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to >> >>> check this seems to say rsp is non-zero, at least. >> >>> >> >>> This is an odd heisenbug I'd like to investigate for sure - I just >> >>> thought I'd shoot this email out because: >> >>> >> >>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can >> >>> create more logical commits to send as patches upstream (thereby >> >>> leaving -O0 upstream, at least temporarily) >> >>> >> >>> - If anyone's seen an odd stack corruption like this, or has any >>
Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review
On Sun, Aug 12, 2018 at 4:47 PM, Amaan Cheval wrote: > Hi! > > I've narrowed the issue down to this bintime function: > https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe > 3411937181/cpukit/score/src/kern_tc.c#L548 > > The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at > "1000", when that function is called (rtems_clock_get_tod -> > _TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and > tvp values there are: > > (gdb) p bt > $2 = {sec = 599562004, frac = 18446744073709551536} > (gdb) p *tvp > $3 = {tv_sec = 599562004, tv_usec = 99} > > The full (relevant) debug log for the "wrong" timing despite the > Clock_driver_ticks being correct is here: > https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54 > > I'm quite unfamiliar with how the low-level internals work and it > looks like it comes from FreeBSD. This is likely a bug from the > timecounter being "too" precise - it dispatches the task at _exactly_ > the tc_freq it promised - if it slips by 1 tick, then the values start > looking correct. > > This looks more like an off-by-one in the low-level code, in that > case, since my clock driver's timecounter returns exactly the value it > ought to be returning (100 when 1 second has passed, for eg., when the > tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec" > value in clockgettod.c causes the wrong second to be set in > "time_buffer"). > > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03- > post-hello/bsps/x86_64/amd64/clock/clock.c#L51 > > On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval > wrote: > > There's another issue I'm having now: > > > > At -O0, ticker.exe works well and passes reliably. At -O2, the TOD > > seems to be rushed a bit: > > > > TA1 - rtems_clock_get_tod - 09:00:00 12/31/1988 > > TA2 - rtems_clock_get_tod - 09:00:00 12/31/1988 > > TA3 - rtems_clock_get_tod - 09:00:00 12/31/1988 > > TA1 - rtems_clock_get_tod - 09:00:04 12/31/1988 > > TA2 - rtems_clock_get_tod - 09:00:09 12/31/1988 > > TA1 - rtems_clock_get_tod - 09:00:09 12/31/1988 > > TA3 - rtems_clock_get_tod - 09:00:14 12/31/1988 > > TA1 - rtems_clock_get_tod - 09:00:14 12/31/1988 > > TA2 - rtems_clock_get_tod - 09:00:19 12/31/1988 > > TA1 - rtems_clock_get_tod - 09:00:19 12/31/1988 > > TA1 - rtems_clock_get_tod - 09:00:24 12/31/1988 > > TA3 - rtems_clock_get_tod - 09:00:29 12/31/1988 > > TA2 - rtems_clock_get_tod - 09:00:29 12/31/1988 > > TA1 - rtems_clock_get_tod - 09:00:29 12/31/1988 > > TA1 - rtems_clock_get_tod - 09:00:34 12/31/1988 > > > > I'm not sure what it could be - I suspected my get_timecount somehow > > not realizing that Clock_driver_ticks was volatile, but that seems to > > be in order. The relevant code is here: > > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03- > post-hello/bsps/x86_64/amd64/clock/clock.c > > > > On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval > wrote: > >> Figured it out; turns out my code to align the stack so I could make > >> calls without raising exceptions was messing up and corrupting the > >> stack-pointer. > >> > >> Running the -O2 code now makes the clock run a bit too quickly - the > >> calibration may have a minor issue. I'll fix that up and send patches > >> tomorrow or Monday hopefully. > >> > >> I'll be traveling Tuesday, so I'd appreciate if we can get them merged > >> upstream Monday itself - I'm okay to have a call and walk someone > >> through the patches and whatnot if need be. > >> > >> Cheers! > >> > >> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval > wrote: > >>> Hi! > >>> > >>> In the process of cleaning my work up, I've run into an odd problem > >>> which only shows up when I set the optimization level to -O2. At -O0, > >>> it's perfectly fine. > >>> > >>> The issue is that somehow execution ends up at address 0x0. > >>> > >>> This likely happens due to a _CPU_Context_switch, where the rsp is set > >>> to a corrupted value, leading to a corrupt (i.e. 0) return address at > >>> the end of the context switch. > >>> > >>> What's curious is that this corruption _seems_ to occur in > >>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value > >>> of rsp up - I honestly don't know this for sure because gdb says one > >>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to > >>> check this seems to say rsp is non-zero, at least. > >>> > >>> This is an odd heisenbug I'd like to investigate for sure - I just > >>> thought I'd shoot this email out because: > >>> > >>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can > >>> create more logical commits to send as patches upstream (thereby > >>> leaving -O0 upstream, at least temporarily) > >>> > >>> - If anyone's seen an odd stack corruption like this, or has any > >>> advice on debugging it, could you let me know? I suspect something > >>> like interrupting tasks which ought not to be interrupted (perhaps I > >>> forgot to implement some
Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review
Hi! I've narrowed the issue down to this bintime function: https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe3411937181/cpukit/score/src/kern_tc.c#L548 The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at "1000", when that function is called (rtems_clock_get_tod -> _TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and tvp values there are: (gdb) p bt $2 = {sec = 599562004, frac = 18446744073709551536} (gdb) p *tvp $3 = {tv_sec = 599562004, tv_usec = 99} The full (relevant) debug log for the "wrong" timing despite the Clock_driver_ticks being correct is here: https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54 I'm quite unfamiliar with how the low-level internals work and it looks like it comes from FreeBSD. This is likely a bug from the timecounter being "too" precise - it dispatches the task at _exactly_ the tc_freq it promised - if it slips by 1 tick, then the values start looking correct. This looks more like an off-by-one in the low-level code, in that case, since my clock driver's timecounter returns exactly the value it ought to be returning (100 when 1 second has passed, for eg., when the tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec" value in clockgettod.c causes the wrong second to be set in "time_buffer"). https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c#L51 On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval wrote: > There's another issue I'm having now: > > At -O0, ticker.exe works well and passes reliably. At -O2, the TOD > seems to be rushed a bit: > > TA1 - rtems_clock_get_tod - 09:00:00 12/31/1988 > TA2 - rtems_clock_get_tod - 09:00:00 12/31/1988 > TA3 - rtems_clock_get_tod - 09:00:00 12/31/1988 > TA1 - rtems_clock_get_tod - 09:00:04 12/31/1988 > TA2 - rtems_clock_get_tod - 09:00:09 12/31/1988 > TA1 - rtems_clock_get_tod - 09:00:09 12/31/1988 > TA3 - rtems_clock_get_tod - 09:00:14 12/31/1988 > TA1 - rtems_clock_get_tod - 09:00:14 12/31/1988 > TA2 - rtems_clock_get_tod - 09:00:19 12/31/1988 > TA1 - rtems_clock_get_tod - 09:00:19 12/31/1988 > TA1 - rtems_clock_get_tod - 09:00:24 12/31/1988 > TA3 - rtems_clock_get_tod - 09:00:29 12/31/1988 > TA2 - rtems_clock_get_tod - 09:00:29 12/31/1988 > TA1 - rtems_clock_get_tod - 09:00:29 12/31/1988 > TA1 - rtems_clock_get_tod - 09:00:34 12/31/1988 > > I'm not sure what it could be - I suspected my get_timecount somehow > not realizing that Clock_driver_ticks was volatile, but that seems to > be in order. The relevant code is here: > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c > > On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval wrote: >> Figured it out; turns out my code to align the stack so I could make >> calls without raising exceptions was messing up and corrupting the >> stack-pointer. >> >> Running the -O2 code now makes the clock run a bit too quickly - the >> calibration may have a minor issue. I'll fix that up and send patches >> tomorrow or Monday hopefully. >> >> I'll be traveling Tuesday, so I'd appreciate if we can get them merged >> upstream Monday itself - I'm okay to have a call and walk someone >> through the patches and whatnot if need be. >> >> Cheers! >> >> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval wrote: >>> Hi! >>> >>> In the process of cleaning my work up, I've run into an odd problem >>> which only shows up when I set the optimization level to -O2. At -O0, >>> it's perfectly fine. >>> >>> The issue is that somehow execution ends up at address 0x0. >>> >>> This likely happens due to a _CPU_Context_switch, where the rsp is set >>> to a corrupted value, leading to a corrupt (i.e. 0) return address at >>> the end of the context switch. >>> >>> What's curious is that this corruption _seems_ to occur in >>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value >>> of rsp up - I honestly don't know this for sure because gdb says one >>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to >>> check this seems to say rsp is non-zero, at least. >>> >>> This is an odd heisenbug I'd like to investigate for sure - I just >>> thought I'd shoot this email out because: >>> >>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can >>> create more logical commits to send as patches upstream (thereby >>> leaving -O0 upstream, at least temporarily) >>> >>> - If anyone's seen an odd stack corruption like this, or has any >>> advice on debugging it, could you let me know? I suspect something >>> like interrupting tasks which ought not to be interrupted (perhaps I >>> forgot to implement some kind of "CPU_ISR_Disable") - is there >>> anything you can think of of that sort? >>> >>> Also, here's a Github PR like last time with all the work (just for >>> the overall changes, not the specific commits!). I'd appreciate a >>> quick review if anyone could - sorry about
Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review
There's another issue I'm having now: At -O0, ticker.exe works well and passes reliably. At -O2, the TOD seems to be rushed a bit: TA1 - rtems_clock_get_tod - 09:00:00 12/31/1988 TA2 - rtems_clock_get_tod - 09:00:00 12/31/1988 TA3 - rtems_clock_get_tod - 09:00:00 12/31/1988 TA1 - rtems_clock_get_tod - 09:00:04 12/31/1988 TA2 - rtems_clock_get_tod - 09:00:09 12/31/1988 TA1 - rtems_clock_get_tod - 09:00:09 12/31/1988 TA3 - rtems_clock_get_tod - 09:00:14 12/31/1988 TA1 - rtems_clock_get_tod - 09:00:14 12/31/1988 TA2 - rtems_clock_get_tod - 09:00:19 12/31/1988 TA1 - rtems_clock_get_tod - 09:00:19 12/31/1988 TA1 - rtems_clock_get_tod - 09:00:24 12/31/1988 TA3 - rtems_clock_get_tod - 09:00:29 12/31/1988 TA2 - rtems_clock_get_tod - 09:00:29 12/31/1988 TA1 - rtems_clock_get_tod - 09:00:29 12/31/1988 TA1 - rtems_clock_get_tod - 09:00:34 12/31/1988 I'm not sure what it could be - I suspected my get_timecount somehow not realizing that Clock_driver_ticks was volatile, but that seems to be in order. The relevant code is here: https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval wrote: > Figured it out; turns out my code to align the stack so I could make > calls without raising exceptions was messing up and corrupting the > stack-pointer. > > Running the -O2 code now makes the clock run a bit too quickly - the > calibration may have a minor issue. I'll fix that up and send patches > tomorrow or Monday hopefully. > > I'll be traveling Tuesday, so I'd appreciate if we can get them merged > upstream Monday itself - I'm okay to have a call and walk someone > through the patches and whatnot if need be. > > Cheers! > > On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval wrote: >> Hi! >> >> In the process of cleaning my work up, I've run into an odd problem >> which only shows up when I set the optimization level to -O2. At -O0, >> it's perfectly fine. >> >> The issue is that somehow execution ends up at address 0x0. >> >> This likely happens due to a _CPU_Context_switch, where the rsp is set >> to a corrupted value, leading to a corrupt (i.e. 0) return address at >> the end of the context switch. >> >> What's curious is that this corruption _seems_ to occur in >> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value >> of rsp up - I honestly don't know this for sure because gdb says one >> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to >> check this seems to say rsp is non-zero, at least. >> >> This is an odd heisenbug I'd like to investigate for sure - I just >> thought I'd shoot this email out because: >> >> - If I can't figure it out tomorrow, soon, I'll just drop it so I can >> create more logical commits to send as patches upstream (thereby >> leaving -O0 upstream, at least temporarily) >> >> - If anyone's seen an odd stack corruption like this, or has any >> advice on debugging it, could you let me know? I suspect something >> like interrupting tasks which ought not to be interrupted (perhaps I >> forgot to implement some kind of "CPU_ISR_Disable") - is there >> anything you can think of of that sort? >> >> Also, here's a Github PR like last time with all the work (just for >> the overall changes, not the specific commits!). I'd appreciate a >> quick review if anyone could - sorry about sending this out over the >> weekend! I've had a surprising share of Heisenbugs with QEMU in the >> past week. >> >> https://github.com/AmaanC/rtems-gsoc18/pull/3/files ___ devel mailing list devel@rtems.org http://lists.rtems.org/mailman/listinfo/devel