I just finished fixing http://crbug.com/75604 aka https://bugs.webkit.org/show_bug.cgi?id=71122 , and it was a long process, and pretty involved, and had a lot of cycles of change, wait for crash data, etc... So here's my postmortum:
TL;DR No matter how sure you are you fixed a crashing bug, keep reading crash reports to be sure. TL;DR2 Practice release crash dump debugging regularly. Read dissassembled code and never trust symbols. TL;DR3 When in doubt, add more state to enhance crash dumps. TL;DR4 WebKit can save stacks in memory now. gav...@webkit.org can help you use this. WHAT WAS HAPPENING The ScriptRunner was crashing, and we were getting a lot of uploads. We couldn't reproduce it by going to the same URIs, but it was definitely happening a lot in the wild. It didn't seem like other WebKit ports were seeing this bug either, which was interesting. WHAT WE DID 1. The crash in ScriptRunner was happening in ScriptRunner::timerFired(), and it looked like a CachedScript in the list of runnable scripts was NULL. How did that happen? So I added a CRASH() to ScriptRunner::queueScriptForExecution(). This meant we would crash before setting up the Async event, and maybe get a more interesting stack on our crash dumps? https://bugs.webkit.org/show_bug.cgi?id=65563 2. From this, we learned a few things. Crash dumps told us we definitely had a broken ScriptElement, and every stack showed a network error was in progress. The CachedScript in the ScriptElement was bad, and we couldn't figure out why. So, we added a hidden state machine to ScriptElement to track down how it was being zeroed. Was it double notification, or was the load continuing after being stopped? Crash dumps would tell. https://bugs.webkit.org/show_bug.cgi?id=66939 3. It was double notification. Nobody could figure out how on earth double notification happened. Boy, I'd sure like to know the stack at the time of the first ScriptElement::notifyFinished() call... So we added a new stack saving facility to WebKit, and used it here to save a stack inside the ScriptElement on every call to ScriptElement::notifyFinished(). Mosquito, meet our sledgehammer. https://bugs.webkit.org/show_bug.cgi?id=69453 4. Oh, so the earlier stack was data received, but with a response code >= 400? How interesting... Finally I could reproduce the bug, and introduce a fix. https://bugs.webkit.org/show_bug.cgi?id=71122 WHAT WORKED 1. Windows minidumps. In Chrome, our dev and canary channels have enabled a very useful option that gives you some heap. In walking the stack on Windows when crashing, any stack element interpretable as a pointer gets the 1024byte range [p-256,p+768] saved into the dump. This feature makes debugging from dev or canary windows minidumps superior to anything else. Learn how to use windb(a)g. 2. Progressive instrumentation. As usual, once I had a reproduction a fix was pretty easy. But, while I didn't have a reproduction, it was useful to add release assertions and watch crash dumps, to progressively zone in on this bug. 3. Adding new facilities to WebKit. You want stacks saved in memory on every platform? OK. https://bugs.webkit.org/show_bug.cgi?id=69018 WHAT DIDN'T WORK 1. Debuggers. Hackers who should have known better looked at release minidumps and symbols, and couldn't understand the variables. For a good example of this mistake from me, see http://code.google.com/p/chromium/issues/detail?id=75604#c21 . Want to see someone even smarter make the same mistake? http://code.google.com/p/chromium/issues/detail?id=75604#c43 The ScriptElement wasn't bogus, but the symbols data was. The symbols indicated ECX had this, when this was actually in EAX at the time, or [ESP]. Don't trust symbols in release builds; instead dissassemble the function, and figure out where the data is. The symbols probably said ECX because at the point of first dereference or maybe at call time, in that scope, this was probably in ECX. But in optimised code, that doesn't have much to do with anything anywhere else in the code. Read the dissassembly. 2. Reading code. I spent far, far too much time doing this. You know what, usually this works for me: I read code, consider the stack, have an AHA! moment, and fix my bug. Too bad this bug was beyond my cognitive horizon, and reading code was just a great way to waste time. 3. Not watching crashes. See comments 30 and 31 in the bug: http://code.google.com/p/chromium/issues/detail?id=75604#c30 . We should not have had confidence the problem was solved; since, hey, it wasn't. SPECIAL THANKS TO a...@webkit.org for encouraging me to instrument where it helps. cbent...@chromium.org for putting up with my yak shaving. ero...@chromium.org for showing me how to use windb(a)g. jap...@chromium.org for never giving up hope. _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev