[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.

2018-12-21 Thread Jason Gerlowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726913#comment-16726913
 ] 

Jason Gerlowski commented on SOLR-13037:


fucit.org hasn't shown any {{branch_7x}} or {{master}} failures for this test 
since the fix went in last week.  So I'm going to mark this as closed.

(There are a few branch_7_6 failures, which makes sense since the fix hasn't 
gone to that branch.  I'm happy to add the fix to that branch as well if anyone 
wants it, but my understanding is that we don't normally do this unless unless 
the fix is for a production-bug.  It might make it marginally easier for anyone 
cutting a theoretical 7.6.1 to get passing builds, which was apparently a 
serious problem with 7.6.  So I've got mixed feelings, but will hold off for 
now.)

> Harden TestSimGenericDistributedQueue.
> --
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Jason Gerlowski
>Priority: Major
> Fix For: master (8.0), 7.7
>
> Attachments: SOLR-13037.patch, repro-log.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.

2018-12-13 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720060#comment-16720060
 ] 

ASF subversion and git services commented on SOLR-13037:


Commit b072a7d264a3cd45ff83ebb97045d05bb758fefd in lucene-solr's branch 
refs/heads/branch_7x from [~gerlowskija]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b072a7d ]

SOLR-13037: Harden TestSimGenericDistributedQueue


> Harden TestSimGenericDistributedQueue.
> --
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-13037.patch, repro-log.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.

2018-12-13 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720059#comment-16720059
 ] 

ASF subversion and git services commented on SOLR-13037:


Commit d7ad2f46c366e466b45d129c737ffd3f125cff5a in lucene-solr's branch 
refs/heads/master from [~gerlowskija]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d7ad2f4 ]

SOLR-13037: Harden TestSimGenericDistributedQueue


> Harden TestSimGenericDistributedQueue.
> --
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-13037.patch, repro-log.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.

2018-12-12 Thread Jason Gerlowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719102#comment-16719102
 ] 

Jason Gerlowski commented on SOLR-13037:


I've attached a patch which takes approach #2 above.  With it, I haven't seen 
any GDQ test failures, though I'll be more confident with more beasting.  Will 
run some tests in the background the rest of today and then commit tonight if 
things still look good. 

> Harden TestSimGenericDistributedQueue.
> --
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Jason Gerlowski
>Priority: Major
> Attachments: SOLR-13037.patch, repro-log.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.

2018-12-12 Thread Jason Gerlowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719000#comment-16719000
 ] 

Jason Gerlowski commented on SOLR-13037:


To (hopefully) explain things a little more clearly, here's the race condition 
I think we're running into here.  There's a few sections of 
{{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one 
in particular.  Check out TestSimDistributedQueue lines 73-74:

(code}
 (new QueueChangerThread(dq,1000)).start();
 assertNotNull(dq.peek(15000));
{code}

This test code has two threads of interest. The QueueChangerThread we see 
created here will sleep for one second, and then insert data into the queue. 
Meanwhile the main test thread will wait for some data to be inserted into the 
queue. Our queue-reading waits a pretty generous amount of time for things to 
enter the queue, so the insert should always finish in time and the read should 
always pick it up.

Some more detail on the operation of each queue operation happens. First the 
queue-write (i.e. {{offer()}}):
 - [Acquire lock 
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461]
 - [Create queue entry node and attach it to 
parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324]
 - [Wake up any threads sleeping on the 'changed' 
Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593]
 - [Release lock 
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465]
 - [Set data for queue 
entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468]

Now the queue-read. Queue-reading works off of a cache of "known queue entries" 
and most queue-reads are handled from there. But the test failure only occurs 
when we need to refresh this cache and read straight from ZK, so I'll skip the 
cache logic here.
 - [Acquire lock 
'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186]
 - [loop until we're out of time to 
wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189]
 ** [look for an element and return if 
non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190]
 ** [sleep until we receive a wakeup  from 'changed' Condition or we time 
out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194]
 - [Release lock 
'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198]

There's a problem with the queue-write code above.  We wake up threads after 
creating the queue-entry, but before it's fully initialized with its data.  
This opens the door to readers seeing the data before it's fully ready and 
going back to sleep.  The 'changed' signalling has already happened, so any 
readers that see the data too early will go back to sleep and not wake up again 
until timeout.

There's a few ways we can fix this:
- we could add a `changed.signalAll()` call at the end of {{offer()}}, to 
ensure that there's at least 1 wakeup after the data has been fully added.
- we can alter the flow of SimDistribStateManager.createData so that the node 
is only attached to the tree after its data has been fully initialized
- we could register a Watcher that triggers on "data-changed", similar to how 
we already trigger a watcher on "child-added"  
 

> Harden TestSimGenericDistributedQueue.
> --
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Jason Gerlowski
>Priority: Major
> Attachments: repro-log.txt
>
>




--
This message was sent by 

[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.

2018-12-12 Thread Jason Gerlowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718885#comment-16718885
 ] 

Jason Gerlowski commented on SOLR-13037:


I've attached a log file which shows the race condition that causes this to 
occur.  Most of this logging is custom, but it should still be helpful for 
others trying to understand the problem.

> Harden TestSimGenericDistributedQueue.
> --
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Jason Gerlowski
>Priority: Major
> Attachments: repro-log.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.

2018-12-03 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707546#comment-16707546
 ] 

Mark Miller commented on SOLR-13037:


{noformat}
[beaster] 2> NOTE: reproduce with: ant test 
-Dtestcase=TestSimGenericDistributedQueue -Dtests.method=testDistributedQueue 
-Dtests.seed=B431D93A4D44AC73 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=es-AR -Dtests.timezone=Asia/Taipei -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
 [beaster] [11:14:10.245] FAILURE 15.0s J7 | 
TestSimGenericDistributedQueue.testDistributedQueue 
{seed=[B431D93A4D44AC73:D7218DCCC2633025] #2} <<<
 [beaster] > Throwable #1: java.lang.AssertionError
 [beaster] > at 
__randomizedtesting.SeedInfo.seed([B431D93A4D44AC73:D7218DCCC2633025]:0)
 [beaster] > at org.junit.Assert.fail(Assert.java:92)
 [beaster] > at org.junit.Assert.assertTrue(Assert.java:43)
 [beaster] > at org.junit.Assert.assertNotNull(Assert.java:526)
 [beaster] > at org.junit.Assert.assertNotNull(Assert.java:537)
 [beaster] > at 
org.apache.solr.cloud.autoscaling.sim.TestSimDistributedQueue.testDistributedQueue(TestSimDistributedQueue.java:74)
 [beaster] > at 
org.apache.solr.cloud.autoscaling.sim.TestSimGenericDistributedQueue.testDistributedQueue(TestSimGenericDistributedQueue.java:37)
{noformat}

> Harden TestSimGenericDistributedQueue.
> --
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org