Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-25 Thread Todd Fiala via lldb-commits
Awesome! Good catch, Adrian!

(And my pleasure!)

-Todd

> On Apr 25, 2016, at 4:13 PM, Adrian McCarthy  wrote:
> 
> amccarth added a comment.
> 
> I got it.  There was an exception being thrown from asyncore so early that it 
> didn't give a proper stack trace.  The problem boils down to Python 3 drawing 
> a distinction between strings and bytearrays.  Patch to come momentarily.
> 
> Thanks for your help.
> 
> 
> http://reviews.llvm.org/D19214
> 
> 
> 
___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-25 Thread Todd Fiala via lldb-commits
tfiala added a subscriber: tfiala.
tfiala added a comment.

Awesome! Good catch, Adrian!

(And my pleasure!)

-Todd


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-25 Thread Todd Fiala via lldb-commits
tfiala added a comment.

Ick.  Yeah that sounds like something to resolve quickly.

They shouldn't be getting stuck.  That would imply that the side running via 
dosep.py (the listener socket) is either (1) no longer running and therefore 
the listener isn't spinning up, although that would be catastrophic and nothing 
else would work, or (2) the listening socket that fires up isn't getting time 
or is for some other reason not getting its send out.

Can you dial down the threads (which in the Windows case then means processes) 
further?  How low are they set to now?


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-25 Thread Adrian McCarthy via lldb-commits
amccarth added a comment.

We already limit threads on Windows to work around other problems.  
Technically, we force Windows to always use the multiprocessing-pool instead of 
the threading-pool, which I think has the effect of limiting the number of 
threads.

I suspect that the first several invocations are stuck waiting "forever" for 
the new ACK byte (perhaps because of buffering?) and that causes the rest of 
the invocations to fail.  I'm still investigating.

Having been broken for a week, it seems we've already collected 30-40 newly 
broken tests on Windows. :-(


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-25 Thread Todd Fiala via lldb-commits
tfiala added a comment.

My guess here would be that the fixing of the race - i.e. ensuring the 
listening side is up and running before allowing the sender to continue, is 
probably keeping more sockets open than before (i.e. fixing the race, but doing 
so by using an average larger number of resources).

There are at least a couple ways to solve that, if that indeed is the issue:

1. Don't run as many threads on Windows (set --threads 24 or 32),

2. We abstract the communication mechanism, and use something different on 
Windows and/or something better for all, that allows bi-directional 
communication and isn't as limited.  The "something different" could be a file 
based approach or some kind of other IPC mechanism.  The key is it needs to be 
compatible with the asyncore approach used as part of the main loop in dosep.py.

What you probably don't want to do is revert this on Windows, since it is 
likely ensuring some test events come in that were getting dropped before.

If you end up getting excessively stuck on this, I'd suggest doing #1 to 
eliminate the limit from being hit, while a better solution for #2 is worked 
out for Windows.  (And if you get stuck on #2, let me know, I may be able to 
dig around a bit and help, or at least get a test suite for that part up so you 
can plug in an implementation that works).


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-25 Thread Adrian McCarthy via lldb-commits
amccarth added a subscriber: amccarth.
amccarth added a comment.

FYI:  This patch seems to have broken all LLDB testing on Windows.  I'll 
investigate why.

Lots of the socket calls now result in stack traces like this:

  Traceback (most recent call last):
File "C:\python_35\lib\multiprocessing\pool.py", line 119, in worker
  result = (True, func(*args, **kwds))
File "C:\python_35\lib\multiprocessing\pool.py", line 44, in mapstar
  return list(map(*args))
File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", 
line 495, in process_dir_worker_multiprocessing_pool
  return process_dir(*args)
File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", 
line 442, in process_dir
  command, timeout, base_name, inferior_pid_events, full_test_path))
File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", 
line 417, in call_with_timeout
  test_filename)
File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", 
line 372, in send_inferior_post_run_events
  send_events_to_collector(post_events, command)
File "D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\dosep.py", 
line 308, in send_events_to_collector
  formatter_spec = result_formatter.create_results_formatter(config)
File 
"D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\result_formatter.py",
 line 112, in create_results_formatter
  results_file_object, cleanup_func = create_socket(config.port)
File 
"D:\src\llvm\llvm\tools\lldb\packages\Python\lldbsuite\test\result_formatter.py",
 line 78, in create_socket
  sock.connect(("localhost", port))
  ConnectionRefusedError: [WinError 10061] No connection could be made because 
the target machine actively refused it

The changes to result_formatter.py are after the problematic line, so the root 
cause it not immediately obvious.  I have a guess that we're trying to open 
more sockets than before and are therefore exceeding a handle limit imposed by 
Python's use of ancient C RTL code.


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-19 Thread Todd Fiala via lldb-commits
tfiala added a comment.

If you're in a position where you can grab a sample (using the OS X sample 
tool) on the bot when they're "hanging", we can try to find out where they are 
at.

The reason I *think* they're not in a method is because they don't have test 
method information associated with them.  But that could be some kind of bug.

In any event, I'm working on this in a TDD fashion now.


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-19 Thread Pavel Labath via lldb-commits
labath added a comment.

In http://reviews.llvm.org/D19214#405204, @tfiala wrote:

> Okay I just looked at those, Pavel.
>
> They fall into a category that I know is not yet handled.  This is the 
> category:
>
> - No test method is "open" (i.e. the dotest.py inferior is not in between a 
> start_test/end_test) when the timeout occurs.  Thus, we cannot "charge" the 
> timeout to a test method.  Thus, it gets charged to the file.


This is fine. I dont have an issue with the presentation of the results.

> 

> 

> - When we rerun, we currently do not clear file-level results.  (We should 
> right before the rerun).

> 

>   This hadn't been prioritized because I haven't in practice run into 
> timeouts that are occurring outside of a test method.  That implies either 
> the startup or the shutdown of the python dotest.py inferior is hanging if 
> you are consistently hitting that.  That in itself is an interesting event to 
> investigate.


I'm not sure that is the case. The test that are reported as timed out are not 
doing anything special (testdefaultcachelinesize, testlistener) or are 
completely skipped on android (testpublicapiheader) so i'd be very surprised if 
we were seeing the timeouts there and nowhere else... It sounds more like a 
test runner bug to me...

In any case, i'll try to look into that once i finish dealing with other things 
bugging us. I just thought you should know about, as i think it could be 
related to this...


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-19 Thread Todd Fiala via lldb-commits
tfiala added a comment.

I filed this bug to track it:
https://llvm.org/bugs/show_bug.cgi?id=27423


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-19 Thread Todd Fiala via lldb-commits
tfiala added a comment.

Okay I just looked at those, Pavel.

They fall into a category that I know is not yet handled.  This is the category:

- No test method is "open" (i.e. the dotest.py inferior is not in between a 
start_test/end_test) when the timeout occurs.  Thus, we cannot "charge" the 
timeout to a test method.  Thus, it gets charged to the file.

- When we rerun, we currently do not clear file-level results.  (We should 
right before the rerun).

This hadn't been prioritized because I haven't in practice run into timeouts 
that are occurring outside of a test method.  That implies either the startup 
or the shutdown of the python dotest.py inferior is hanging if you are 
consistently hitting that.  That in itself is an interesting event to 
investigate.

I'll file a bug on this since there is a reasonable solution that will solve 
this one. I'll get a bug tracking that once I finish writing this.


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-19 Thread Todd Fiala via lldb-commits
tfiala added a comment.

Interesting, thanks Pavel!

I'll look at those as I start looking to test the test infrastructure.


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-19 Thread Pavel Labath via lldb-commits
labath added a comment.

BTW, we are still seeing these fake timeouts on our buildbot 
http://lab.llvm.org:8011/builders/lldb-x86_64-darwin-13.4/builds/9770, so I am 
not sure if this actually fixed the problem.   It could be that this is 
actually a different problem, but it does fit your description very well. In 
any case, I just thought I'd let you know about that...


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-18 Thread Todd Fiala via lldb-commits
tfiala closed this revision.
tfiala added a comment.

Closing with commit:
r266624


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-18 Thread Pavel Labath via lldb-commits
labath accepted this revision.
labath added a comment.
This revision is now accepted and ready to land.

This sounds a lot like the problem that has been plaguing our darwin buildbot 
(interestingly, I have never seen it happen on linux). Thanks for tracking that 
down.

I am not sure I understand the architecture of this system completely, but it 
sounds like a reasonable thing to do.


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


Re: [Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-17 Thread Todd Fiala via lldb-commits
tfiala added a comment.

The fix simply has the listener socket always send a byte, and the initiators 
of the socket always wait for that byte before continuing.  This ensures that 
both cases where the sender is creating a socket and sending test events will 
never exit the dosep.py main loop before the listener socket is added to the 
asyncore map.  This fixes the race.

The two initiators of sockets are:

1. the dotest.py inferior process that is running a test case.  This is the 
channel that normal test events (e.g. test start, test result, etc.) come over.

2. the dosep.py worker queues (either true threads via the threading module, or 
faux threads that are really processes, via the multiprocessing library).  They 
will spin up a socket if they need to report that a test inferior has generated 
an exceptional exit (i.e. seg fault or some other signal-generating exit) or if 
a timeout forced the process to be killed.

It is the second case above, during a rerun pass, that was most likely to race.


http://reviews.llvm.org/D19214



___
lldb-commits mailing list
lldb-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits


[Lldb-commits] [PATCH] D19214: fix a race is the LLDB test suite results collection

2016-04-17 Thread Todd Fiala via lldb-commits
tfiala created this revision.
tfiala added a reviewer: labath.
tfiala added a subscriber: lldb-commits.

The race boiled down to this:

If a test worker queue is able to run the test inferior and
clean up before the dosep.py listener socket is spun up, and
the worker queue is the last one (as would be the case when
there's only one test rerunning in the rerun queue), then
the test suite will exit the main loop before having a chance
to process any test events coming from the test inferior or
the worker queue job control.

I found this race to be far more likely on fast hardware.
Our Linux CI is one such example.  While it will show
up primarily during meta test events generated by
a worker thread when a test inferior times out or
exits with an exceptional exit (e.g. seg fault), it only
requires that the OS takes longer to hook up the
listener socket than it takes for the final test inferior
and worker thread to shut down.

http://reviews.llvm.org/D19214

Files:
  packages/Python/lldbsuite/test/dosep.py
  packages/Python/lldbsuite/test/dotest_channels.py
  packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park
  packages/Python/lldbsuite/test/result_formatter.py

Index: packages/Python/lldbsuite/test/result_formatter.py
===
--- packages/Python/lldbsuite/test/result_formatter.py
+++ packages/Python/lldbsuite/test/result_formatter.py
@@ -76,6 +76,18 @@
 
 sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 sock.connect(("localhost", port))
+
+# Wait for the ack from the listener side.
+# This is needed to prevent a race condition
+# in the main dosep.py processing loop: we
+# can't allow a worker queue thread to die
+# that has outstanding messages to a listener
+# socket before the listener socket asyncore
+# listener socket gets spun up; otherwise,
+# we lose the test result info.
+read_bytes = sock.recv(1)
+# print("\n** socket creation: received ack: {}".format(ord(read_bytes[0])), file=sys.stderr)
+
 return (sock, lambda: socket_closer(sock))
 
 default_formatter_name = None
Index: packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park
===
--- packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park
+++ packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py.park
@@ -3,19 +3,21 @@
 
 import time
 
-import lldbsuite.test.lldbtest as lldbtest
+import lldbsuite.test.decorators as decorators
 import rerun_base
 
 
 class RerunTimeoutTestCase(rerun_base.RerunBaseTestCase):
-@lldbtest.no_debug_info_test
+@decorators.no_debug_info_test
 def test_timeout_rerun_succeeds(self):
-"""Tests that timeout logic kicks in and is picked up."""
+"""Tests that the timeout logic kicks in and that this timeout is picked up."""
 if not self.should_generate_issue():
 # We pass this time.
 return
+
 # We time out this time.
 while True:
+# noinspection PyBroadException
 try:
 time.sleep(1)
 except:
Index: packages/Python/lldbsuite/test/dotest_channels.py
===
--- packages/Python/lldbsuite/test/dotest_channels.py
+++ packages/Python/lldbsuite/test/dotest_channels.py
@@ -55,6 +55,14 @@
 # unpickled results.
 raise Exception("forwarding function must be set")
 
+# Initiate all connections by sending an ack.  This allows
+# the initiators of the socket to await this to ensure
+# that this end is up and running (and therefore already
+# into the async map).
+ack_bytes = bytearray()
+ack_bytes.append(chr(42))
+file_object.send(ack_bytes)
+
 def deserialize_payload(self):
 """Unpickles the collected input buffer bytes and forwards."""
 if len(self.ibuffer) > 0:
Index: packages/Python/lldbsuite/test/dosep.py
===
--- packages/Python/lldbsuite/test/dosep.py
+++ packages/Python/lldbsuite/test/dosep.py
@@ -109,13 +109,17 @@
 global GET_WORKER_INDEX
 GET_WORKER_INDEX = get_worker_index_use_pid
 
-def report_test_failure(name, command, output):
+def report_test_failure(name, command, output, timeout):
 global output_lock
 with output_lock:
 if not (RESULTS_FORMATTER and RESULTS_FORMATTER.is_using_terminal()):
 print(file=sys.stderr)
 print(output, file=sys.stderr)
-print("[%s FAILED]" % name, file=sys.stderr)
+if timeout:
+timeout_str = " (TIMEOUT)"
+else:
+timeout_str = ""
+print("[%s