Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
Awesome! Good catch, Adrian! (And my pleasure!) -Todd > On Apr 25, 2016, at 4:13 PM, Adrian McCarthywrote: > > amccarth added a comment. > > I got it. There was an exception being thrown from asyncore so early that it > didn't give a proper stack trace. The problem boils down to Python 3 drawing > a distinction between strings and bytearrays. Patch to come momentarily. > > Thanks for your help. > > > http://reviews.llvm.org/D19214 > > > ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a subscriber: tfiala. tfiala added a comment. Awesome! Good catch, Adrian! (And my pleasure!) -Todd http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a comment. Ick. Yeah that sounds like something to resolve quickly. They shouldn't be getting stuck. That would imply that the side running via dosep.py (the listener socket) is either (1) no longer running and therefore the listener isn't spinning up, although that would be catastrophic and nothing else would work, or (2) the listening socket that fires up isn't getting time or is for some other reason not getting its send out. Can you dial down the threads (which in the Windows case then means processes) further? How low are they set to now? http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
amccarth added a comment. We already limit threads on Windows to work around other problems. Technically, we force Windows to always use the multiprocessing-pool instead of the threading-pool, which I think has the effect of limiting the number of threads. I suspect that the first several invocations are stuck waiting "forever" for the new ACK byte (perhaps because of buffering?) and that causes the rest of the invocations to fail. I'm still investigating. Having been broken for a week, it seems we've already collected 30-40 newly broken tests on Windows. :-( http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a comment. My guess here would be that the fixing of the race - i.e. ensuring the listening side is up and running before allowing the sender to continue, is probably keeping more sockets open than before (i.e. fixing the race, but doing so by using an average larger number of resources). There are at least a couple ways to solve that, if that indeed is the issue: 1. Don't run as many threads on Windows (set --threads 24 or 32), 2. We abstract the communication mechanism, and use something different on Windows and/or something better for all, that allows bi-directional communication and isn't as limited. The "something different" could be a file based approach or some kind of other IPC mechanism. The key is it needs to be compatible with the asyncore approach used as part of the main loop in dosep.py. What you probably don't want to do is revert this on Windows, since it is likely ensuring some test events come in that were getting dropped before. If you end up getting excessively stuck on this, I'd suggest doing #1 to eliminate the limit from being hit, while a better solution for #2 is worked out for Windows. (And if you get stuck on #2, let me know, I may be able to dig around a bit and help, or at least get a test suite for that part up so you can plug in an implementation that works). http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
amccarth added a subscriber: amccarth. amccarth added a comment. FYI: This patch seems to have broken all LLDB testing on Windows. I'll investigate why. Lots of the socket calls now result in stack traces like this: Traceback (most recent call last): File "C:\python_35\lib\multiprocessing\pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "C:\python_35\lib\multiprocessing\pool.py", line 44, in mapstar return list(map(*args)) File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 495, in process_dir_worker_multiprocessing_pool return process_dir(*args) File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 442, in process_dir command, timeout, base_name, inferior_pid_events, full_test_path)) File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 417, in call_with_timeout test_filename) File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 372, in send_inferior_post_run_events send_events_to_collector(post_events, command) File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", line 308, in send_events_to_collector formatter_spec = result_formatter.create_results_formatter(config) File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\result_formatter.py", line 112, in create_results_formatter results_file_object, cleanup_func = create_socket(config.port) File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\result_formatter.py", line 78, in create_socket sock.connect(("localhost", port)) ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it The changes to result_formatter.py are after the problematic line, so the root cause it not immediately obvious. I have a guess that we're trying to open more sockets than before and are therefore exceeding a handle limit imposed by Python's use of ancient C RTL code. http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a comment. If you're in a position where you can grab a sample (using the OS X sample tool) on the bot when they're "hanging", we can try to find out where they are at. The reason I *think* they're not in a method is because they don't have test method information associated with them. But that could be some kind of bug. In any event, I'm working on this in a TDD fashion now. http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
labath added a comment. In http://reviews.llvm.org/D19214#405204, @tfiala wrote: > Okay I just looked at those, Pavel. > > They fall into a category that I know is not yet handled. This is the > category: > > - No test method is "open" (i.e. the dotest.py inferior is not in between a > start_test/end_test) when the timeout occurs. Thus, we cannot "charge" the > timeout to a test method. Thus, it gets charged to the file. This is fine. I dont have an issue with the presentation of the results. > > > - When we rerun, we currently do not clear file-level results. (We should > right before the rerun). > > This hadn't been prioritized because I haven't in practice run into > timeouts that are occurring outside of a test method. That implies either > the startup or the shutdown of the python dotest.py inferior is hanging if > you are consistently hitting that. That in itself is an interesting event to > investigate. I'm not sure that is the case. The test that are reported as timed out are not doing anything special (testdefaultcachelinesize, testlistener) or are completely skipped on android (testpublicapiheader) so i'd be very surprised if we were seeing the timeouts there and nowhere else... It sounds more like a test runner bug to me... In any case, i'll try to look into that once i finish dealing with other things bugging us. I just thought you should know about, as i think it could be related to this... http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a comment. I filed this bug to track it: https://llvm.org/bugs/show_bug.cgi?id=27423 http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a comment. Okay I just looked at those, Pavel. They fall into a category that I know is not yet handled. This is the category: - No test method is "open" (i.e. the dotest.py inferior is not in between a start_test/end_test) when the timeout occurs. Thus, we cannot "charge" the timeout to a test method. Thus, it gets charged to the file. - When we rerun, we currently do not clear file-level results. (We should right before the rerun). This hadn't been prioritized because I haven't in practice run into timeouts that are occurring outside of a test method. That implies either the startup or the shutdown of the python dotest.py inferior is hanging if you are consistently hitting that. That in itself is an interesting event to investigate. I'll file a bug on this since there is a reasonable solution that will solve this one. I'll get a bug tracking that once I finish writing this. http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a comment. Interesting, thanks Pavel! I'll look at those as I start looking to test the test infrastructure. http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
labath added a comment. BTW, we are still seeing these fake timeouts on our buildbot http://lab.llvm.org:8011/builders/lldb-x86_64-darwin-13.4/builds/9770, so I am not sure if this actually fixed the problem. It could be that this is actually a different problem, but it does fit your description very well. In any case, I just thought I'd let you know about that... http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala closed this revision. tfiala added a comment. Closing with commit: r266624 http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
labath accepted this revision. labath added a comment. This revision is now accepted and ready to land. This sounds a lot like the problem that has been plaguing our darwin buildbot (interestingly, I have never seen it happen on linux). Thanks for tracking that down. I am not sure I understand the architecture of this system completely, but it sounds like a reasonable thing to do. http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala added a comment. The fix simply has the listener socket always send a byte, and the initiators of the socket always wait for that byte before continuing. This ensures that both cases where the sender is creating a socket and sending test events will never exit the dosep.py main loop before the listener socket is added to the asyncore map. This fixes the race. The two initiators of sockets are: 1. the dotest.py inferior process that is running a test case. This is the channel that normal test events (e.g. test start, test result, etc.) come over. 2. the dosep.py worker queues (either true threads via the threading module, or faux threads that are really processes, via the multiprocessing library). They will spin up a socket if they need to report that a test inferior has generated an exceptional exit (i.e. seg fault or some other signal-generating exit) or if a timeout forced the process to be killed. It is the second case above, during a rerun pass, that was most likely to race. http://reviews.llvm.org/D19214 ___ lldb-commits mailing list lldb-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
[Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection
tfiala created this revision. tfiala added a reviewer: labath. tfiala added a subscriber: lldb-commits. The race boiled down to this: If a test worker queue is able to run the test inferior and clean up before the dosep.py listener socket is spun up, and the worker queue is the last one (as would be the case when there's only one test rerunning in the rerun queue), then the test suite will exit the main loop before having a chance to process any test events coming from the test inferior or the worker queue job control. I found this race to be far more likely on fast hardware. Our Linux CI is one such example. While it will show up primarily during meta test events generated by a worker thread when a test inferior times out or exits with an exceptional exit (e.g. seg fault), it only requires that the OS takes longer to hook up the listener socket than it takes for the final test inferior and worker thread to shut down. http://reviews.llvm.org/D19214 Files: packages/Python/lldbsuite/test/dosep.py packages/Python/lldbsuite/test/dotest_channels.py packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park packages/Python/lldbsuite/test/result_formatter.py Index: packages/Python/lldbsuite/test/result_formatter.py === --- packages/Python/lldbsuite/test/result_formatter.py +++ packages/Python/lldbsuite/test/result_formatter.py @@ -76,6 +76,18 @@ sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect(("localhost", port)) + +# Wait for the ack from the listener side. +# This is needed to prevent a race condition +# in the main dosep.py processing loop: we +# can't allow a worker queue thread to die +# that has outstanding messages to a listener +# socket before the listener socket asyncore +# listener socket gets spun up; otherwise, +# we lose the test result info. +read_bytes = sock.recv(1) +# print("\n** socket creation: received ack: {}".format(ord(read_bytes[0])), file=sys.stderr) + return (sock, lambda: socket_closer(sock)) default_formatter_name = None Index: packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park === --- packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park +++ packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park @@ -3,19 +3,21 @@ import time -import lldbsuite.test.lldbtest as lldbtest +import lldbsuite.test.decorators as decorators import rerun_base class RerunTimeoutTestCase(rerun_base.RerunBaseTestCase): -@lldbtest.no_debug_info_test +@decorators.no_debug_info_test def test_timeout_rerun_succeeds(self): -"""Tests that timeout logic kicks in and is picked up.""" +"""Tests that the timeout logic kicks in and that this timeout is picked up.""" if not self.should_generate_issue(): # We pass this time. return + # We time out this time. while True: +# noinspection PyBroadException try: time.sleep(1) except: Index: packages/Python/lldbsuite/test/dotest_channels.py === --- packages/Python/lldbsuite/test/dotest_channels.py +++ packages/Python/lldbsuite/test/dotest_channels.py @@ -55,6 +55,14 @@ # unpickled results. raise Exception("forwarding function must be set") +# Initiate all connections by sending an ack. This allows +# the initiators of the socket to await this to ensure +# that this end is up and running (and therefore already +# into the async map). +ack_bytes = bytearray() +ack_bytes.append(chr(42)) +file_object.send(ack_bytes) + def deserialize_payload(self): """Unpickles the collected input buffer bytes and forwards.""" if len(self.ibuffer) > 0: Index: packages/Python/lldbsuite/test/dosep.py === --- packages/Python/lldbsuite/test/dosep.py +++ packages/Python/lldbsuite/test/dosep.py @@ -109,13 +109,17 @@ global GET_WORKER_INDEX GET_WORKER_INDEX = get_worker_index_use_pid -def report_test_failure(name, command, output): +def report_test_failure(name, command, output, timeout): global output_lock with output_lock: if not (RESULTS_FORMATTER and RESULTS_FORMATTER.is_using_terminal()): print(file=sys.stderr) print(output, file=sys.stderr) -print("[%s FAILED]" % name, file=sys.stderr) +if timeout: +timeout_str = " (TIMEOUT)" +else: +timeout_str = "" +print("[%s