Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review

2018-08-16 Thread Amaan Cheval
On Tue, Aug 14, 2018 at 7:34 PM, Joel Sherrill  wrote:
>
>
> On Sun, Aug 12, 2018 at 4:47 PM, Amaan Cheval 
> wrote:
>>
>> Hi!
>>
>> I've narrowed the issue down to this bintime function:
>>
>> https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe3411937181/cpukit/score/src/kern_tc.c#L548
>>
>> The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at
>> "1000", when that function is called (rtems_clock_get_tod ->
>> _TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and
>> tvp values there are:
>>
>> (gdb) p bt
>> $2 = {sec = 599562004, frac = 18446744073709551536}
>> (gdb) p *tvp
>> $3 = {tv_sec = 599562004, tv_usec = 99}
>>
>> The full (relevant) debug log for the "wrong" timing despite the
>> Clock_driver_ticks being correct is here:
>> https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54
>>
>> I'm quite unfamiliar with how the low-level internals work and it
>> looks like it comes from FreeBSD. This is likely a bug from the
>> timecounter being "too" precise - it dispatches the task at _exactly_
>> the tc_freq it promised - if it slips by 1 tick, then the values start
>> looking correct.
>>
>> This looks more like an off-by-one in the low-level code, in that
>> case, since my clock driver's timecounter returns exactly the value it
>> ought to be returning (100 when 1 second has passed, for eg., when the
>> tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec"
>> value in clockgettod.c causes the wrong second to be set in
>> "time_buffer").
>>
>>
>> https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c#L51
>>
>> On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval 
>> wrote:
>> > There's another issue I'm having now:
>> >
>> > At -O0, ticker.exe works well and passes reliably. At -O2, the TOD
>> > seems to be rushed a bit:
>> >
>> > TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988
>> > TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988
>> > TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988
>> > TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988
>> > TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988
>> > TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988
>> >
>> > I'm not sure what it could be - I suspected my get_timecount somehow
>> > not realizing that Clock_driver_ticks was volatile, but that seems to
>> > be in order. The relevant code is here:
>> >
>> > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c
>> >
>> > On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval 
>> > wrote:
>> >> Figured it out; turns out my code to align the stack so I could make
>> >> calls without raising exceptions was messing up and corrupting the
>> >> stack-pointer.
>> >>
>> >> Running the -O2 code now makes the clock run a bit too quickly - the
>> >> calibration may have a minor issue. I'll fix that up and send patches
>> >> tomorrow or Monday hopefully.
>> >>
>> >> I'll be traveling Tuesday, so I'd appreciate if we can get them merged
>> >> upstream Monday itself - I'm okay to have a call and walk someone
>> >> through the patches and whatnot if need be.
>> >>
>> >> Cheers!
>> >>
>> >> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval 
>> >> wrote:
>> >>> Hi!
>> >>>
>> >>> In the process of cleaning my work up, I've run into an odd problem
>> >>> which only shows up when I set the optimization level to -O2. At -O0,
>> >>> it's perfectly fine.
>> >>>
>> >>> The issue is that somehow execution ends up at address 0x0.
>> >>>
>> >>> This likely happens due to a _CPU_Context_switch, where the rsp is set
>> >>> to a corrupted value, leading to a corrupt (i.e. 0) return address at
>> >>> the end of the context switch.
>> >>>
>> >>> What's curious is that this corruption _seems_ to occur in
>> >>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value
>> >>> of rsp up - I honestly don't know this for sure because gdb says one
>> >>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to
>> >>> check this seems to say rsp is non-zero, at least.
>> >>>
>> >>> This is an odd heisenbug I'd like to investigate for sure - I just
>> >>> thought I'd shoot this email out because:
>> >>>
>> >>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can
>> >>> create more logical commits to send as patches upstream (thereby
>> >>> leaving -O0 upstream, at least temporarily)
>> >>>
>> >>> - If anyone's seen an odd stack corruption like this, or has any
>> 

Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review

2018-08-14 Thread Joel Sherrill
On Sun, Aug 12, 2018 at 4:47 PM, Amaan Cheval 
wrote:

> Hi!
>
> I've narrowed the issue down to this bintime function:
> https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe
> 3411937181/cpukit/score/src/kern_tc.c#L548
>
> The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at
> "1000", when that function is called (rtems_clock_get_tod ->
> _TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and
> tvp values there are:
>
> (gdb) p bt
> $2 = {sec = 599562004, frac = 18446744073709551536}
> (gdb) p *tvp
> $3 = {tv_sec = 599562004, tv_usec = 99}
>
> The full (relevant) debug log for the "wrong" timing despite the
> Clock_driver_ticks being correct is here:
> https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54
>
> I'm quite unfamiliar with how the low-level internals work and it
> looks like it comes from FreeBSD. This is likely a bug from the
> timecounter being "too" precise - it dispatches the task at _exactly_
> the tc_freq it promised - if it slips by 1 tick, then the values start
> looking correct.
>
> This looks more like an off-by-one in the low-level code, in that
> case, since my clock driver's timecounter returns exactly the value it
> ought to be returning (100 when 1 second has passed, for eg., when the
> tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec"
> value in clockgettod.c causes the wrong second to be set in
> "time_buffer").
>
> https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-
> post-hello/bsps/x86_64/amd64/clock/clock.c#L51
>
> On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval 
> wrote:
> > There's another issue I'm having now:
> >
> > At -O0, ticker.exe works well and passes reliably. At -O2, the TOD
> > seems to be rushed a bit:
> >
> > TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988
> > TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988
> > TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988
> > TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988
> > TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988
> > TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988
> >
> > I'm not sure what it could be - I suspected my get_timecount somehow
> > not realizing that Clock_driver_ticks was volatile, but that seems to
> > be in order. The relevant code is here:
> > https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-
> post-hello/bsps/x86_64/amd64/clock/clock.c
> >
> > On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval 
> wrote:
> >> Figured it out; turns out my code to align the stack so I could make
> >> calls without raising exceptions was messing up and corrupting the
> >> stack-pointer.
> >>
> >> Running the -O2 code now makes the clock run a bit too quickly - the
> >> calibration may have a minor issue. I'll fix that up and send patches
> >> tomorrow or Monday hopefully.
> >>
> >> I'll be traveling Tuesday, so I'd appreciate if we can get them merged
> >> upstream Monday itself - I'm okay to have a call and walk someone
> >> through the patches and whatnot if need be.
> >>
> >> Cheers!
> >>
> >> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval 
> wrote:
> >>> Hi!
> >>>
> >>> In the process of cleaning my work up, I've run into an odd problem
> >>> which only shows up when I set the optimization level to -O2. At -O0,
> >>> it's perfectly fine.
> >>>
> >>> The issue is that somehow execution ends up at address 0x0.
> >>>
> >>> This likely happens due to a _CPU_Context_switch, where the rsp is set
> >>> to a corrupted value, leading to a corrupt (i.e. 0) return address at
> >>> the end of the context switch.
> >>>
> >>> What's curious is that this corruption _seems_ to occur in
> >>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value
> >>> of rsp up - I honestly don't know this for sure because gdb says one
> >>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to
> >>> check this seems to say rsp is non-zero, at least.
> >>>
> >>> This is an odd heisenbug I'd like to investigate for sure - I just
> >>> thought I'd shoot this email out because:
> >>>
> >>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can
> >>> create more logical commits to send as patches upstream (thereby
> >>> leaving -O0 upstream, at least temporarily)
> >>>
> >>> - If anyone's seen an odd stack corruption like this, or has any
> >>> advice on debugging it, could you let me know? I suspect something
> >>> like interrupting tasks which ought not to be interrupted (perhaps I
> >>> forgot to implement some 

Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review

2018-08-12 Thread Amaan Cheval
Hi!

I've narrowed the issue down to this bintime function:
https://github.com/RTEMS/rtems/blob/b2de4260c5c71e518742731a8cdebe3411937181/cpukit/score/src/kern_tc.c#L548

The watchdog ticks in _Per_CPU_Information / Clock_driver_ticks are at
"1000", when that function is called (rtems_clock_get_tod ->
_TOD_Get_timeval -> _Timecounter_Microtime -> microtime). The bt and
tvp values there are:

(gdb) p bt
$2 = {sec = 599562004, frac = 18446744073709551536}
(gdb) p *tvp
$3 = {tv_sec = 599562004, tv_usec = 99}

The full (relevant) debug log for the "wrong" timing despite the
Clock_driver_ticks being correct is here:
https://gist.github.com/AmaanC/c59caf5232b03054d457dcacb5ab1c54

I'm quite unfamiliar with how the low-level internals work and it
looks like it comes from FreeBSD. This is likely a bug from the
timecounter being "too" precise - it dispatches the task at _exactly_
the tc_freq it promised - if it slips by 1 tick, then the values start
looking correct.

This looks more like an off-by-one in the low-level code, in that
case, since my clock driver's timecounter returns exactly the value it
ought to be returning (100 when 1 second has passed, for eg., when the
tc_frequency=100Hz - in that case the bintime's returned "now.tv_sec"
value in clockgettod.c causes the wrong second to be set in
"time_buffer").

https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c#L51

On Sun, Aug 12, 2018 at 2:48 PM, Amaan Cheval  wrote:
> There's another issue I'm having now:
>
> At -O0, ticker.exe works well and passes reliably. At -O2, the TOD
> seems to be rushed a bit:
>
> TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988
> TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988
> TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988
> TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988
> TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988
> TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988
> TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988
> TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988
> TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988
> TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988
> TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988
> TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988
> TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988
> TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988
> TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988
>
> I'm not sure what it could be - I suspected my get_timecount somehow
> not realizing that Clock_driver_ticks was volatile, but that seems to
> be in order. The relevant code is here:
> https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c
>
> On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval  wrote:
>> Figured it out; turns out my code to align the stack so I could make
>> calls without raising exceptions was messing up and corrupting the
>> stack-pointer.
>>
>> Running the -O2 code now makes the clock run a bit too quickly - the
>> calibration may have a minor issue. I'll fix that up and send patches
>> tomorrow or Monday hopefully.
>>
>> I'll be traveling Tuesday, so I'd appreciate if we can get them merged
>> upstream Monday itself - I'm okay to have a call and walk someone
>> through the patches and whatnot if need be.
>>
>> Cheers!
>>
>> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval  wrote:
>>> Hi!
>>>
>>> In the process of cleaning my work up, I've run into an odd problem
>>> which only shows up when I set the optimization level to -O2. At -O0,
>>> it's perfectly fine.
>>>
>>> The issue is that somehow execution ends up at address 0x0.
>>>
>>> This likely happens due to a _CPU_Context_switch, where the rsp is set
>>> to a corrupted value, leading to a corrupt (i.e. 0) return address at
>>> the end of the context switch.
>>>
>>> What's curious is that this corruption _seems_ to occur in
>>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value
>>> of rsp up - I honestly don't know this for sure because gdb says one
>>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to
>>> check this seems to say rsp is non-zero, at least.
>>>
>>> This is an odd heisenbug I'd like to investigate for sure - I just
>>> thought I'd shoot this email out because:
>>>
>>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can
>>> create more logical commits to send as patches upstream (thereby
>>> leaving -O0 upstream, at least temporarily)
>>>
>>> - If anyone's seen an odd stack corruption like this, or has any
>>> advice on debugging it, could you let me know? I suspect something
>>> like interrupting tasks which ought not to be interrupted (perhaps I
>>> forgot to implement some kind of "CPU_ISR_Disable") - is there
>>> anything you can think of of that sort?
>>>
>>> Also, here's a Github PR like last time with all the work (just for
>>> the overall changes, not the specific commits!). I'd appreciate a
>>> quick review if anyone could - sorry about 

Re: [GSoC - x86_64] Pre-merge issues (at -O2 optimization level) and WIP review

2018-08-12 Thread Amaan Cheval
There's another issue I'm having now:

At -O0, ticker.exe works well and passes reliably. At -O2, the TOD
seems to be rushed a bit:

TA1  - rtems_clock_get_tod - 09:00:00   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:00   12/31/1988
TA3  - rtems_clock_get_tod - 09:00:00   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:04   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:09   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:09   12/31/1988
TA3  - rtems_clock_get_tod - 09:00:14   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:14   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:19   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:19   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:24   12/31/1988
TA3  - rtems_clock_get_tod - 09:00:29   12/31/1988
TA2  - rtems_clock_get_tod - 09:00:29   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:29   12/31/1988
TA1  - rtems_clock_get_tod - 09:00:34   12/31/1988

I'm not sure what it could be - I suspected my get_timecount somehow
not realizing that Clock_driver_ticks was volatile, but that seems to
be in order. The relevant code is here:
https://github.com/AmaanC/rtems-gsoc18/blob/ac/daily-03-post-hello/bsps/x86_64/amd64/clock/clock.c

On Sun, Aug 12, 2018 at 3:43 AM, Amaan Cheval  wrote:
> Figured it out; turns out my code to align the stack so I could make
> calls without raising exceptions was messing up and corrupting the
> stack-pointer.
>
> Running the -O2 code now makes the clock run a bit too quickly - the
> calibration may have a minor issue. I'll fix that up and send patches
> tomorrow or Monday hopefully.
>
> I'll be traveling Tuesday, so I'd appreciate if we can get them merged
> upstream Monday itself - I'm okay to have a call and walk someone
> through the patches and whatnot if need be.
>
> Cheers!
>
> On Sun, Aug 12, 2018 at 1:25 AM, Amaan Cheval  wrote:
>> Hi!
>>
>> In the process of cleaning my work up, I've run into an odd problem
>> which only shows up when I set the optimization level to -O2. At -O0,
>> it's perfectly fine.
>>
>> The issue is that somehow execution ends up at address 0x0.
>>
>> This likely happens due to a _CPU_Context_switch, where the rsp is set
>> to a corrupted value, leading to a corrupt (i.e. 0) return address at
>> the end of the context switch.
>>
>> What's curious is that this corruption _seems_ to occur in
>> _ISR_Handler's call to _Thread_Dispatch, by somehow messing the value
>> of rsp up - I honestly don't know this for sure because gdb says one
>> thing (i.e. that rsp = 0), but setting up some code (cmpq $0, rsp) to
>> check this seems to say rsp is non-zero, at least.
>>
>> This is an odd heisenbug I'd like to investigate for sure - I just
>> thought I'd shoot this email out because:
>>
>> - If I can't figure it out tomorrow, soon, I'll just drop it so I can
>> create more logical commits to send as patches upstream (thereby
>> leaving -O0 upstream, at least temporarily)
>>
>> - If anyone's seen an odd stack corruption like this, or has any
>> advice on debugging it, could you let me know? I suspect something
>> like interrupting tasks which ought not to be interrupted (perhaps I
>> forgot to implement some kind of "CPU_ISR_Disable") - is there
>> anything you can think of of that sort?
>>
>> Also, here's a Github PR like last time with all the work (just for
>> the overall changes, not the specific commits!). I'd appreciate a
>> quick review if anyone could - sorry about sending this out over the
>> weekend! I've had a surprising share of Heisenbugs with QEMU in the
>> past week.
>>
>> https://github.com/AmaanC/rtems-gsoc18/pull/3/files
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel