Hi all,

I am very certain I found an extremely nasty JIT compiler bug in the 
current version of Chrome (which is 51.0.2704.103 m for me at the time of 
writing):
The problem is that I am having a really hard time finding a test-case that 
would be acceptable in a CRBug issue. The issue to me only happens in about 
1 of 30 page reloads. With some of my colleages it happens more often and 
customers have reported it happening every 5 to 10 times, but it only ever 
happens on a large code base (about 2 to 10 Megabytes of minified 
Javascript code) and stripping down the problem is extremely difficult 
because of the non-deterministic reproducibility: If I reduce the code and 
the problem goes away, I cannot be sure whether this is just bad luck (and 
I need to reload 10 or 50 more times) or I actually need to bisect in the 
other direction and the problem really went away. How do you guys approach 
such a problem? 

Here is what I found:
In our Javascript library we offer the possibility to perform additional 
runtime type checks for all the public API members: For each API member we 
dynamically create a wrapper function that accesses separate meta-data that 
it uses to perform type checks on the arguments of the function. The result 
is that for every function we have defined, there is a another function 
which ultimately delegates to the original function using "call(this, 
arguments)" but only after it has checked the arguments for valididity. The 
meta data for each check is found in the closure of the function, so the 
source code is shared between all wrapper functions, but each instance of 
the function has a different closure context that keeps references to the 
types and argument counts. 
Now some of the type checks are more complex and it can happen that during 
a type check we descend recursively into some parts of our API and so the 
same typechecking source code may be entered recursively (but with a 
different context) - the typechecking framework itself does in fact 
type-check itself. This works perfectly in 95% of the time in Chrome and it 
works flawlessley without problems in older versions of Chrome (48?) and 
all other browsers. However with recent version updates we found that 
sometimes we were getting stack overflows (maximum call stack size 
exceeded) and the closer analysis showed that the following is happening: 
Sometimes when we call a function (which is actually the type check wrapped 
variant) and then inside the original function another type check is 
performed - the closure context of the inner type check suddenly gets the 
values from the outer frame and if the values are fine from the typecheck 
point of view this of course results in an endless recursion. 
Stackoverflows aren't caught by ("break on exception") so inspecting the 
stack was not possible for us. However we found a second exception that 
happened directly *after* the stack overflow when more code was executed 
and this time we were able to "break on exception" and inspect the stack 
frames and this is what we found:
In the stack we can see that one method call results in a type check and 
inside the type check another (different) type checked method is called, 
however if we look at "arguments.callee" and if we inspect the closure 
variables in the inner type check we get the values from the ancestor 
stackframes: so even though one method is called, inside that very same 
method in the next stackframe "arguments.callee" is different from the 
reference at the call site. This should actually be an invariant if I'm not 
very much mistaken: when I call a method, inside that method 
"arguments.callee" should be the referential same entity that I am calling, 
but this is not the case in test when the other exception happens. I only 
get an exception because my type check complains that it is being invoked 
with the wrong arguments (because the type meta data is wrong and the check 
is performed in a different method). If I didn't get that exception the 
code would happily continue to execute, but working with the wrong 
data/closure context. This is what worries me: It's very hard to debug 
because it only happens sometimes after a reload (probably when JIT kicks 
in for the first time or something like that) and often time it will not 
result in an exception but you will "just" get data corruption. 
That's why I think this bug is a really severe bug that we should try to 
fix as soon as possible, because this could cause thousands of hours of 
Javascript debugging being wasted for a bug that just isn't in the 
javascript, but in the JIT (I believe). And that's why I would like to 
report that bug, but my current test-case is several megabytes in size and 
executes for one or two seconds before it fails and this only happens every 
few dozen times. Do you have an idea how to debug this? How could I trim 
down the test-case? Would a memory heap help here? I took a heap dump once 
I was in that bad state to inspect the closure/context values of the 
functions in question, but I was not able to view the optimized code with 
the Chrome developer tools.

Any help or advice you might have would be greatly appreciated. I would 
hate it to know that there is a bug like this in Chrome that is sitting 
there and can break applications or corrupt data almost randomly and 
knowing that you cannot avoid it and there is no way to fix it.

Thanks - Sebastian

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to