[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.
[ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726913#comment-16726913 ] Jason Gerlowski commented on SOLR-13037: fucit.org hasn't shown any {{branch_7x}} or {{master}} failures for this test since the fix went in last week. So I'm going to mark this as closed. (There are a few branch_7_6 failures, which makes sense since the fix hasn't gone to that branch. I'm happy to add the fix to that branch as well if anyone wants it, but my understanding is that we don't normally do this unless unless the fix is for a production-bug. It might make it marginally easier for anyone cutting a theoretical 7.6.1 to get passing builds, which was apparently a serious problem with 7.6. So I've got mixed feelings, but will hold off for now.) > Harden TestSimGenericDistributedQueue. > -- > > Key: SOLR-13037 > URL: https://issues.apache.org/jira/browse/SOLR-13037 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Mark Miller >Assignee: Jason Gerlowski >Priority: Major > Fix For: master (8.0), 7.7 > > Attachments: SOLR-13037.patch, repro-log.txt > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.
[ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720060#comment-16720060 ] ASF subversion and git services commented on SOLR-13037: Commit b072a7d264a3cd45ff83ebb97045d05bb758fefd in lucene-solr's branch refs/heads/branch_7x from [~gerlowskija] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b072a7d ] SOLR-13037: Harden TestSimGenericDistributedQueue > Harden TestSimGenericDistributedQueue. > -- > > Key: SOLR-13037 > URL: https://issues.apache.org/jira/browse/SOLR-13037 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Mark Miller >Assignee: Jason Gerlowski >Priority: Major > Attachments: SOLR-13037.patch, repro-log.txt > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.
[ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720059#comment-16720059 ] ASF subversion and git services commented on SOLR-13037: Commit d7ad2f46c366e466b45d129c737ffd3f125cff5a in lucene-solr's branch refs/heads/master from [~gerlowskija] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d7ad2f4 ] SOLR-13037: Harden TestSimGenericDistributedQueue > Harden TestSimGenericDistributedQueue. > -- > > Key: SOLR-13037 > URL: https://issues.apache.org/jira/browse/SOLR-13037 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Mark Miller >Assignee: Jason Gerlowski >Priority: Major > Attachments: SOLR-13037.patch, repro-log.txt > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.
[ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719102#comment-16719102 ] Jason Gerlowski commented on SOLR-13037: I've attached a patch which takes approach #2 above. With it, I haven't seen any GDQ test failures, though I'll be more confident with more beasting. Will run some tests in the background the rest of today and then commit tonight if things still look good. > Harden TestSimGenericDistributedQueue. > -- > > Key: SOLR-13037 > URL: https://issues.apache.org/jira/browse/SOLR-13037 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Mark Miller >Assignee: Jason Gerlowski >Priority: Major > Attachments: SOLR-13037.patch, repro-log.txt > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.
[ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719000#comment-16719000 ] Jason Gerlowski commented on SOLR-13037: To (hopefully) explain things a little more clearly, here's the race condition I think we're running into here. There's a few sections of {{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one in particular. Check out TestSimDistributedQueue lines 73-74: (code} (new QueueChangerThread(dq,1000)).start(); assertNotNull(dq.peek(15000)); {code} This test code has two threads of interest. The QueueChangerThread we see created here will sleep for one second, and then insert data into the queue. Meanwhile the main test thread will wait for some data to be inserted into the queue. Our queue-reading waits a pretty generous amount of time for things to enter the queue, so the insert should always finish in time and the read should always pick it up. Some more detail on the operation of each queue operation happens. First the queue-write (i.e. {{offer()}}): - [Acquire lock 'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461] - [Create queue entry node and attach it to parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324] - [Wake up any threads sleeping on the 'changed' Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593] - [Release lock 'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465] - [Set data for queue entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468] Now the queue-read. Queue-reading works off of a cache of "known queue entries" and most queue-reads are handled from there. But the test failure only occurs when we need to refresh this cache and read straight from ZK, so I'll skip the cache logic here. - [Acquire lock 'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186] - [loop until we're out of time to wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189] ** [look for an element and return if non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190] ** [sleep until we receive a wakeup from 'changed' Condition or we time out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194] - [Release lock 'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198] There's a problem with the queue-write code above. We wake up threads after creating the queue-entry, but before it's fully initialized with its data. This opens the door to readers seeing the data before it's fully ready and going back to sleep. The 'changed' signalling has already happened, so any readers that see the data too early will go back to sleep and not wake up again until timeout. There's a few ways we can fix this: - we could add a `changed.signalAll()` call at the end of {{offer()}}, to ensure that there's at least 1 wakeup after the data has been fully added. - we can alter the flow of SimDistribStateManager.createData so that the node is only attached to the tree after its data has been fully initialized - we could register a Watcher that triggers on "data-changed", similar to how we already trigger a watcher on "child-added" > Harden TestSimGenericDistributedQueue. > -- > > Key: SOLR-13037 > URL: https://issues.apache.org/jira/browse/SOLR-13037 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Mark Miller >Assignee: Jason Gerlowski >Priority: Major > Attachments: repro-log.txt > > -- This message was sent by
[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.
[ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718885#comment-16718885 ] Jason Gerlowski commented on SOLR-13037: I've attached a log file which shows the race condition that causes this to occur. Most of this logging is custom, but it should still be helpful for others trying to understand the problem. > Harden TestSimGenericDistributedQueue. > -- > > Key: SOLR-13037 > URL: https://issues.apache.org/jira/browse/SOLR-13037 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Mark Miller >Assignee: Jason Gerlowski >Priority: Major > Attachments: repro-log.txt > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13037) Harden TestSimGenericDistributedQueue.
[ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707546#comment-16707546 ] Mark Miller commented on SOLR-13037: {noformat} [beaster] 2> NOTE: reproduce with: ant test -Dtestcase=TestSimGenericDistributedQueue -Dtests.method=testDistributedQueue -Dtests.seed=B431D93A4D44AC73 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=es-AR -Dtests.timezone=Asia/Taipei -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [beaster] [11:14:10.245] FAILURE 15.0s J7 | TestSimGenericDistributedQueue.testDistributedQueue {seed=[B431D93A4D44AC73:D7218DCCC2633025] #2} <<< [beaster] > Throwable #1: java.lang.AssertionError [beaster] > at __randomizedtesting.SeedInfo.seed([B431D93A4D44AC73:D7218DCCC2633025]:0) [beaster] > at org.junit.Assert.fail(Assert.java:92) [beaster] > at org.junit.Assert.assertTrue(Assert.java:43) [beaster] > at org.junit.Assert.assertNotNull(Assert.java:526) [beaster] > at org.junit.Assert.assertNotNull(Assert.java:537) [beaster] > at org.apache.solr.cloud.autoscaling.sim.TestSimDistributedQueue.testDistributedQueue(TestSimDistributedQueue.java:74) [beaster] > at org.apache.solr.cloud.autoscaling.sim.TestSimGenericDistributedQueue.testDistributedQueue(TestSimGenericDistributedQueue.java:37) {noformat} > Harden TestSimGenericDistributedQueue. > -- > > Key: SOLR-13037 > URL: https://issues.apache.org/jira/browse/SOLR-13037 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Mark Miller >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org