Hi all,
I am very certain I found an extremely nasty JIT compiler bug in the
current version of Chrome (which is 51.0.2704.103 m for me at the time of
writing):
The problem is that I am having a really hard time finding a test-case that
would be acceptable in a CRBug issue. The issue to me only happens in about
1 of 30 page reloads. With some of my colleages it happens more often and
customers have reported it happening every 5 to 10 times, but it only ever
happens on a large code base (about 2 to 10 Megabytes of minified
Javascript code) and stripping down the problem is extremely difficult
because of the non-deterministic reproducibility: If I reduce the code and
the problem goes away, I cannot be sure whether this is just bad luck (and
I need to reload 10 or 50 more times) or I actually need to bisect in the
other direction and the problem really went away. How do you guys approach
such a problem?
Here is what I found:
In our Javascript library we offer the possibility to perform additional
runtime type checks for all the public API members: For each API member we
dynamically create a wrapper function that accesses separate meta-data that
it uses to perform type checks on the arguments of the function. The result
is that for every function we have defined, there is a another function
which ultimately delegates to the original function using "call(this,
arguments)" but only after it has checked the arguments for valididity. The
meta data for each check is found in the closure of the function, so the
source code is shared between all wrapper functions, but each instance of
the function has a different closure context that keeps references to the
types and argument counts.
Now some of the type checks are more complex and it can happen that during
a type check we descend recursively into some parts of our API and so the
same typechecking source code may be entered recursively (but with a
different context) - the typechecking framework itself does in fact
type-check itself. This works perfectly in 95% of the time in Chrome and it
works flawlessley without problems in older versions of Chrome (48?) and
all other browsers. However with recent version updates we found that
sometimes we were getting stack overflows (maximum call stack size
exceeded) and the closer analysis showed that the following is happening:
Sometimes when we call a function (which is actually the type check wrapped
variant) and then inside the original function another type check is
performed - the closure context of the inner type check suddenly gets the
values from the outer frame and if the values are fine from the typecheck
point of view this of course results in an endless recursion.
Stackoverflows aren't caught by ("break on exception") so inspecting the
stack was not possible for us. However we found a second exception that
happened directly *after* the stack overflow when more code was executed
and this time we were able to "break on exception" and inspect the stack
frames and this is what we found:
In the stack we can see that one method call results in a type check and
inside the type check another (different) type checked method is called,
however if we look at "arguments.callee" and if we inspect the closure
variables in the inner type check we get the values from the ancestor
stackframes: so even though one method is called, inside that very same
method in the next stackframe "arguments.callee" is different from the
reference at the call site. This should actually be an invariant if I'm not
very much mistaken: when I call a method, inside that method
"arguments.callee" should be the referential same entity that I am calling,
but this is not the case in test when the other exception happens. I only
get an exception because my type check complains that it is being invoked
with the wrong arguments (because the type meta data is wrong and the check
is performed in a different method). If I didn't get that exception the
code would happily continue to execute, but working with the wrong
data/closure context. This is what worries me: It's very hard to debug
because it only happens sometimes after a reload (probably when JIT kicks
in for the first time or something like that) and often time it will not
result in an exception but you will "just" get data corruption.
That's why I think this bug is a really severe bug that we should try to
fix as soon as possible, because this could cause thousands of hours of
Javascript debugging being wasted for a bug that just isn't in the
javascript, but in the JIT (I believe). And that's why I would like to
report that bug, but my current test-case is several megabytes in size and
executes for one or two seconds before it fails and this only happens every
few dozen times. Do you have an idea how to debug this? How could I trim
down the test-case? Would a memory heap help here? I took a heap dump once
I was in that bad state to inspect the closure/context values of the
functions in question, but I was not able to view the optimized code with
the Chrome developer tools.
Any help or advice you might have would be greatly appreciated. I would
hate it to know that there is a bug like this in Chrome that is sitting
there and can break applications or corrupt data almost randomly and
knowing that you cannot avoid it and there is no way to fix it.
Thanks - Sebastian
--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.