[jira] [Reopened] (SOLR-13622) Add FileStream Streaming Expression
[ https://issues.apache.org/jira/browse/SOLR-13622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened SOLR-13622: - Uwe's jenkins servers weren't being included in my reports for over a month – that's why you didn't see any StreamExpressionTest failures in my reports when you looked last month. in reality Uwe's windows builds started picking up a new type of failure: that file handles are being leaked (and thus the test framework can't close them)... {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=StreamExpressionTest -Dtests.seed=607225F2726A5625 -Dtests.slow=true -Dtests.locale=ar-PS -Dtests.timezone=Kwajalein -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.00s J1 | StreamExpressionTest (suite) <<< [junit4]> Throwable #1: java.io.IOException: Could not remove the following files (in the order of attempts): [junit4]> C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1\secondLevel2.txt: java.nio.file.FileSystemException: C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1\secondLevel2.txt: The process cannot access the file because it is being used by another process. [junit4]> C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1: java.nio.file.DirectoryNotEmptyException: C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1 [junit4]> C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles: java.nio.file.DirectoryNotEmptyException: C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles [junit4]> C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2: java.nio.file.DirectoryNotEmptyException: C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2 [junit4]> C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001: java.nio.file.DirectoryNotEmptyException: C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001 [junit4]> C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001: java.nio.file.DirectoryNotEmptyException: C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001 [junit4]>at __randomizedtesting.SeedInfo.seed([607225F2726A5625]:0) [junit4]>at org.apache.lucene.util.IOUtils.rm(IOUtils.java:319) [junit4]>at java.base/java.lang.Thread.run(Thread.java:835) {noformat} I've only ever seen {{secondLevel2.txt}} show up as being the problem -- based on how the test works, that suggests _either_ the multi file usage (ie {{cat("topLevel1.txt,directory1\secondLevel2.txt")}} usage causes it's _second_ arg to be leaked, _or_ the single arg directory usage (ie: {{cat("directory1")}} causes the last file in the directory to be leaked (*or both*) skimming the code in CatStream this doesn't seem too suprising -- AFAICT the only time {{currentFileLines}} get's closed is when {{maxLines}} get's exceeded, or when {{allFilesToCrawl.hasNext()}} is true ... if there are no more files to crawl, or a file is 0 bytes (ie: {{currentFileLines.hasNext()}} never returns true) then the current / "last" file will never be closed. > Add FileStream Streaming Expression > --- > > Key: SOLR-13622 > URL: https://issues.apache.org/jira/browse/SOLR-13622 > Project: Solr > Issue Type: New Feature > Components: streaming expressions >Reporter: Joel Bernstein >Assi
[jira] [Commented] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)
[ https://issues.apache.org/jira/browse/SOLR-13746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926763#comment-16926763 ] Hoss Man commented on SOLR-13746: - thanks steve. > Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs) > -- > > Key: SOLR-13746 > URL: https://issues.apache.org/jira/browse/SOLR-13746 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > > I just realized that back in June, there was a misscommunication between > myself & Uwe (and a lack of double checking on my part!) regarding upgrading > the JVM versions on our jenkins machines... > * > [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e] > * > [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E] > ...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM > used on the _*apache*_ jenkins nodes is still (as of 2019-09-06) > "11.0.1+13-LTS" ... > [https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText] > {noformat} > ... > [java-info] java version "11.0.1" > [java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle > Corporation) > [java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle > Corporation) > ... > {noformat} > This means that even after the changes made in SOLR-12988 to re-enable SSL > testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the > yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use > SSL (either explicitly, or due to randomization) becauseof the logic in > SSLTestConfig to detects bad JVM versions an prevent confusion/spurious > failures. > We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 > ASAP. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9658) Caches should have an optional way to clean if idle for 'x' mins
[ https://issues.apache.org/jira/browse/SOLR-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926186#comment-16926186 ] Hoss Man commented on SOLR-9658: {quote}refactored cache impls to allow inserting synthetic entries, and changed the unit tests to use these methods. It turned out that the management of oldestEntry needs to be improved in all caches when we allow the creation time in more recently added entries to go back... {quote} Ah interesting ... IIUC the existing code (in ConcurrentLFUCache for example) just tracks "lastAccessed" for each cache entry (and "oldest" for the cache as a whole) via an incremented counter across all entries – but now you're using actual NANO_SECOND timestamps. This seems like an "ok" change (the API has never exposed these "lastAccessed" values correct?) but I just want to double check since you've looked at this & thought about it more then me: do you see any risk here? (ie: please don't let me talk you into an Impl change that's "a bad idea" just because it makes the kind of test I was advocating easier to write) Feedback on other aspects of the patch (all minor and/or nitpicks – in generally this all seems solid) ... * AFAICT there should no longer be any need to modify TimeSource / TestTimeSource since tests no longer use/need advanceMs, correct ? * {{SolrCache.MAX_IDLE_TIME}} doesn't seem to have a name consistent w/the other variables in that interface ... seems like it should be {{SolrCache.MAX_IDLE_TIME_PARAM}} ? ** There are also a couple of places in LFUCache and LRUCache (where other existing {{*_PARAM}} constants are used) that seem to use the string literal {{"maxIdleTime"}} instead of using that new variable. * IIUC This isn't a mistake, it's a deliberate "clean up" change because the existing code includes this {{put(RAM_BYTES_USED_PARAM, ...)}} twice a few lines apart, correct? ... {code:java} -map.put(RAM_BYTES_USED_PARAM, ramBytesUsed()); +map.put("cumulative_idleEvictions", cidleEvictions); {code} * Is there any reason not to make these final in both ConcurrentLFUCache & ConcurrentLRUCache? {code:java} private TimeSource timeSource = TimeSource.NANO_TIME; private AtomicLong oldestEntry = new AtomicLong(0L); {code} * re: this line in {{TestLFUCache.testMaxIdleTimeEviction}} ... ** {{assertEquals("markAndSweep spurious run", 1, sweepFinished.getCount());}} ** a more thread safe way to have this type of assertion... {code:java} final AtomicLong numSweepsStarted = new AtomicLong(0); // NEW final CountDownLatch sweepFinished = new CountDownLatch(1); ConcurrentLRUCache cache = new ConcurrentLRUCache<>(6, 5, 5, 6, false, false, null, IDLE_TIME_SEC) { @Override public void markAndSweep() { numSweepsStarted.incrementAndGet(); // NEW super.markAndSweep(); sweepFinished.countDown(); } }; ... assertEquals("markAndSweep spurious runs", 0L, numSweepsStarted.get()); // CHANGED {code} ** I think that pattern exists in another test as well? * we need to make sure the javadocs & ref-guide are updated to cover this new option, and be clear to users on how it interacts with other things (ie: that the idle sweep happens before the other sweeps and trumps things like the "entry size" checks) > Caches should have an optional way to clean if idle for 'x' mins > > > Key: SOLR-9658 > URL: https://issues.apache.org/jira/browse/SOLR-9658 > Project: Solr > Issue Type: New Feature >Reporter: Noble Paul >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 8.3 > > Attachments: SOLR-9658.patch, SOLR-9658.patch, SOLR-9658.patch, > SOLR-9658.patch, SOLR-9658.patch, SOLR-9658.patch > > > If a cache is idle for long, it consumes precious memory. It should be > configurable to clear the cache if it was not accessed for 'x' secs. The > cache configuration can have an extra config {{maxIdleTime}} . if we wish it > to the cleaned after 10 mins of inactivity set it to {{maxIdleTime=600}}. > [~dragonsinth] would it be a solution for the memory leak you mentioned? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)
[ https://issues.apache.org/jira/browse/SOLR-13746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925947#comment-16925947 ] Hoss Man commented on SOLR-13746: - bq. ... No idea, there is an issue / mail thread already at ASF about AdoptOpenJDK. ... bq. ... I think we should get Infra involved, at a minimum to ask if we should be managing JDKs on a self-serve basis. ... So i'm not really clear on where we stand now... IIUC in order to upgrade past 11.0.1 (which is broken) we need to use (Adopt)OpenJDK because oracle hasn't made 11.0.(2|3|4) builds available? -- and It sounds like is there an INFRA issue or mail archive thread somewhere about being able to use OpenJDK ... can someone post a link to that? is that a discussion that's being had in public or in private? (even if it's private, can someone post a link to it so folks w/request karma can access it) Is the infra conversation something happening "in the abstract" of "if/when OpenJDK builds can/should be used", or is it concretely about the need for specific projects to switch? ... ie: Has the fact that 11.0.1 is broken and effectively unusable for a lot of Solr testing been mentioned in the context of that discussion? Can/should we be filing an INFRA jira explicitly requesting upgraded JDKs so it's clear there is a demonstrable need? (does such an issue already exist already? can someone please link it here?) Finally: is docker available on the jenkins build slaves, because worst case scenario we could tweak our apache jenkins jobs to run inside docker containers that always use the latest AdoptOpenJDK base images, ala... https://github.com/hossman/solr-jenkins-docker-tester bq. Should we also add this note to the JVM bugs page: https://cwiki.apache.org/confluence/display/lucene/JavaBugs#JavaBugs-OracleJava/SunJava/OpenJDKBugs I thought someone already did this in response to an email thread about this general topci a ffew months ago -- but maybe not? the list of known JVM SSL bugs is well documented in SOLR-12988 -- anyone who wants to take a stabe at summarizing that info in the wiki or release notes of Solr is welcome to take a stab at it (my focus has been on tests themselves and trying to figure out if there are any other SSL bugs we're overlooked ... something i'm now freaking out about more as i realized none of the apache jenkins jobs have actaully been testing SSL) > Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs) > -- > > Key: SOLR-13746 > URL: https://issues.apache.org/jira/browse/SOLR-13746 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > > I just realized that back in June, there was a misscommunication between > myself & Uwe (and a lack of double checking on my part!) regarding upgrading > the JVM versions on our jenkins machines... > * > [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e] > * > [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E] > ...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM > used on the _*apache*_ jenkins nodes is still (as of 2019-09-06) > "11.0.1+13-LTS" ... > [https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText] > {noformat} > ... > [java-info] java version "11.0.1" > [java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle > Corporation) > [java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle > Corporation) > ... > {noformat} > This means that even after the changes made in SOLR-12988 to re-enable SSL > testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the > yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use > SSL (either explicitly, or due to randomization) becauseof the logic in > SSLTestConfig to detects bad JVM versions an prevent confusion/spurious > failures. > We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 > ASAP. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13745) Test should close resources: AtomicUpdateProcessorFactoryTest
[ https://issues.apache.org/jira/browse/SOLR-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925923#comment-16925923 ] Hoss Man commented on SOLR-13745: - bq. ... It'd be nice if failing to close a SolrQueryRequest might be enforced in tests ... I haven't dug into how/where exactly the ObjectTrracking logic helps enforce that we're closing things like SolrIndexSearcher, but in theory there isn't any reason it couldn't also enforce that we're closing (Local)SolrQueryRequest objects? ... i think? > Test should close resources: AtomicUpdateProcessorFactoryTest > -- > > Key: SOLR-13745 > URL: https://issues.apache.org/jira/browse/SOLR-13745 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Fix For: 8.3 > > > This tests hangs after the test runs because there are directory or request > resources (not sure yet) that are not closed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13745) Test should close resources: AtomicUpdateProcessorFactoryTest
[ https://issues.apache.org/jira/browse/SOLR-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924690#comment-16924690 ] Hoss Man commented on SOLR-13745: - Interesting... David: i suspect the reason these test bugs didn't manifest until after your commits in SOLR-13728 is because the new code you added in that issue causes DistributedUpdateProcessor to now call {{req.getSearcher().count(...)}} – resulting in {{SolrQueryRequestBase.searcherHolder}} getting populated in a way that it wouldn't have been previously for some of the {{LocalSolrQueryRequest}} instances used in this test. As for why it didn't fail when you ran tests before committing SOLR-13728 ... i'm guessing that maybe this is because of SOLR-13747 / SOLR-12988 ? (I've already confirmed SOLR-13746 is the reason [yetus's patch review build of SOLR-13728|https://builds.apache.org/job/PreCommit-SOLR-Build/543/testReport/] didn't catch this either) > Test should close resources: AtomicUpdateProcessorFactoryTest > -- > > Key: SOLR-13745 > URL: https://issues.apache.org/jira/browse/SOLR-13745 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Fix For: 8.3 > > > This tests hangs after the test runs because there are directory or request > resources (not sure yet) that are not closed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs
[ https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13747: Attachment: SOLR-13747.patch Status: Open (was: Open) Some background... In SOLR-12988, during the dicsussion of re-enabling SSL testing under java11 knowing that some java 11 versions were broken, I made the following comments... {quote} (on the Junit tests side, having assumes around JVM version is fine – because even then it's not a "silent" behavior change, it's an explicitly "test ignored because XYZ") {quote} {quote} if devs are running tests with a broken JVM, then the tests can & should fail ... that's the job of the tests. it's a bad idea to make the tests "hide" the failure by "faking" that things work using a degraded cipher, or skipping SSL completely (yes, i also think mark's changes to SSLTestConfig in December as part of his commit on this issue was a terrible idea as well) ... the *ONLY* thing we should _consider_ allowing tests to change about their behavior if they see a JVM is "broken" is to SKIP ie: assume(SomethingThatIsFalseForTheBrokenJVM) {quote} Ultimately, adding an {{SSLTestConfig.assumeSslIsSafeToTest()}} method seemed better then doing a hard {{fail(..)}} in any test that wanted to use SSL -- particularly once we realized that (at that time) every available version of Java 13 was affected by SSL bugs. {{SKIP}} ing tests (instead of failing outright) ment we could still have jenkins jobs running the latest jdk13-ea available looking for _other_ bugs, w/o getting noise due to known SSL bugs. But the fact that SOLR-13746 slipped through the cracks has caused me to seriously regret that decision -- and lead me to wonder: * Do we have committers who are _still_ running {{ant test}} with "bad" JDKs that don't realize thousands of tests are getting skipped? * What if down the road a jenkins node gets rebuilt/reverted to use an older jdk11 version, would anyone notice? The attached patch adds a new {{TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs}} to the {{solr/test-framework}} module that does what i's name implies (with an informative message) when it detects that {{SSLTestConfig.assumeSslIsSafeToTest()}} throws an assumption in the the current JVM. I considered just replacing {{SSLTestConfig.assumeSslIsSafeToTest()}} with a {{SSLTestConfig.failTheBuildUnlesseSslIsSafeToTest()}} but realized that the potential deluge of thousands of test failures that might occur for an aspiring contributor who attempts to run Solr tests w/no idea their JDK is broken could be overwhelming and scare people off before they even begin. A single clear cut error (in addition to thousands of tests being {{SKIP}} ed) seemed more inviting. I should note: It's possible that down the road we will again find ourselves in this situation... bq. ...particularly once we realized that (at that time) every available version of Java 13 was affected by SSL bugs... ...with some future "Java XX", whose every available 'ea' build we recognize as being completely broken for SSL -- but we want still want to let jenkins try to look for _other_ bugs w/o the "noise" of this test failing every build. If that day comes, we can update {{SSLTestConfig.assumeSslIsSafeToTest()}} to {{SKIP}} SSL on those JVM builds, and "whitelist" them in {{TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs}}. > 'ant test' should fail on JVM's w/known SSL bugs > > > Key: SOLR-13747 > URL: https://issues.apache.org/jira/browse/SOLR-13747 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-13747.patch > > > If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has > known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this > -- so the user knows they should upgrade their JVM (rather then relying on > the user to notice that SSL tests were {{SKIP}} ed) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs
[ https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13747: Description: If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this -- so the user knows they should upgrade their JVM (rather then relying on the user to notice that SSL tests were {{SKIP}} ed) (was: If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this -- so the user knows they should upgrade their JVM (rather then relying on the user to notice that SSL tests were {{SKIP}}ed)) > 'ant test' should fail on JVM's w/known SSL bugs > > > Key: SOLR-13747 > URL: https://issues.apache.org/jira/browse/SOLR-13747 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > > If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has > known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this > -- so the user knows they should upgrade their JVM (rather then relying on > the user to notice that SSL tests were {{SKIP}} ed) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs
Hoss Man created SOLR-13747: --- Summary: 'ant test' should fail on JVM's w/known SSL bugs Key: SOLR-13747 URL: https://issues.apache.org/jira/browse/SOLR-13747 Project: Solr Issue Type: Test Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this -- so the user knows they should upgrade their JVM (rather then relying on the user to notice that SSL tests were {{SKIP}}ed) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)
[ https://issues.apache.org/jira/browse/SOLR-13746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924661#comment-16924661 ] Hoss Man commented on SOLR-13746: - [~thetaphi] / [~steve_rowe] - is this still something you guys have control over, or do we need to get infra involved? > Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs) > -- > > Key: SOLR-13746 > URL: https://issues.apache.org/jira/browse/SOLR-13746 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > > I just realized that back in June, there was a misscommunication between > myself & Uwe (and a lack of double checking on my part!) regarding upgrading > the JVM versions on our jenkins machines... > * > [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e] > * > [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E] > ...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM > used on the _*apache*_ jenkins nodes is still (as of 2019-09-06) > "11.0.1+13-LTS" ... > [https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText] > {noformat} > ... > [java-info] java version "11.0.1" > [java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle > Corporation) > [java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle > Corporation) > ... > {noformat} > This means that even after the changes made in SOLR-12988 to re-enable SSL > testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the > yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use > SSL (either explicitly, or due to randomization) becauseof the logic in > SSLTestConfig to detects bad JVM versions an prevent confusion/spurious > failures. > We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 > ASAP. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)
Hoss Man created SOLR-13746: --- Summary: Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs) Key: SOLR-13746 URL: https://issues.apache.org/jira/browse/SOLR-13746 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man I just realized that back in June, there was a misscommunication between myself & Uwe (and a lack of double checking on my part!) regarding upgrading the JVM versions on our jenkins machines... * [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e] * [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E] ...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM used on the _*apache*_ jenkins nodes is still (as of 2019-09-06) "11.0.1+13-LTS" ... [https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText] {noformat} ... [java-info] java version "11.0.1" [java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle Corporation) [java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle Corporation) ... {noformat} This means that even after the changes made in SOLR-12988 to re-enable SSL testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use SSL (either explicitly, or due to randomization) becauseof the logic in SSLTestConfig to detects bad JVM versions an prevent confusion/spurious failures. We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 ASAP. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13728) Fail partial updates if it would inadvertently remove nested docs
[ https://issues.apache.org/jira/browse/SOLR-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924463#comment-16924463 ] Hoss Man commented on SOLR-13728: - Huh? No i'm directly refering to Commit c8203e4787b8ad21e1270781ba4e09fd7f3acb00 ... {noformat} hossman@slate:~/lucene/dev [j11] [master] $ git co c8203e4787b8ad21e1270781ba4e09fd7f3acb00 && ant clean && cd solr/core/ && ant test -Dtestcase=AtomicUpdateProcessorFactoryTest ... [junit4] 2> NOTE: Linux 5.0.0-27-generic amd64/AdoptOpenJDK 11.0.4 (64-bit)/cpus=8,threads=2,free=199278080,total=522190848 [junit4] 2> NOTE: All tests run in this JVM: [AtomicUpdateProcessorFactoryTest] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=AtomicUpdateProcessorFactoryTest -Dtests.seed=9CA837338CB8D055 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=eu-ES -Dtests.timezone=Indian/Kerguelen -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.00s | AtomicUpdateProcessorFactoryTest (suite) <<< [junit4]> Throwable #1: java.lang.AssertionError: ObjectTracker found 6 object(s) that were not released!!! [SolrCore, SolrIndexSearcher, MockDirectoryWrapper, MockDirectoryWrapper, SolrIndexSearcher, MockDirectoryWrapper] [junit4]> org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException: org.apache.solr.core.SolrCore [junit4]>at org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:42) [junit4]>at org.apache.solr.core.SolrCore.(SolrCore.java:1093) ... hossman@slate:~/lucene/dev/solr/core [j11] [c8203e4787b] $ cd ../../ && git co c8203e4787b8ad21e1270781ba4e09fd7f3acb00~1 Previous HEAD position was c8203e4787b SOLR-13728: fail partial updates to child docs when not supported. HEAD is now at 2552986e872 LUCENE-8917: Fix Solr's TestCodecSupport to stop trying to use the now-removed Direct docValues format hossman@slate:~/lucene/dev [j11] [2552986e872] $ ant clean && cd solr/core/ && ant test -Dtestcase=AtomicUpdateProcessorFactoryTest ... common.test: BUILD SUCCESSFUL Total time: 1 minute 10 seconds hossman@slate:~/lucene/dev/solr/core [j11] [2552986e872] $ ant test -Dtestcase=AtomicUpdateProcessorFactoryTest -Dtests.seed=9CA837338CB8D055 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=eu-ES -Dtests.timezone=Indian/Kerguelen -Dtests.asserts=true -Dtests.file.encoding=US-ASCII ... common.test: BUILD SUCCESSFUL Total time: 19 seconds {noformat} > Fail partial updates if it would inadvertently remove nested docs > - > > Key: SOLR-13728 > URL: https://issues.apache.org/jira/browse/SOLR-13728 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Fix For: 8.3 > > Attachments: SOLR-13728.patch > > > In SOLR-12638 Solr gained the ability to do partial updates (aka atomic > updates) to nested documents. However this feature only works if the schema > meets certain circumstances. We can know we don't support it and fail the > request – what I propose here. This is much friendlier than wiping out > existing documents. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-13728) Fail partial updates if it would inadvertently remove nested docs
[ https://issues.apache.org/jira/browse/SOLR-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened SOLR-13728: - these commits appear to be the cause of a 100% failure rate in {{ant test -Dtestcase=AtomicUpdateProcessorFactoryTest}} in recent jenkins builds. the failures reproduce for me on master, regardless of see or any other jvm options (haven't tested branch_8x) yet. the failures related to tracking of unclosed directories... {noformat} [junit4] 2> 17393 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 ] o.a.s.c.CachingDirectoryFactory Timeout waiting for all directory ref counts to be released - gave up waiting on CachedDir<> [junit4] 2> 17397 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 ] o.a.s.c.CachingDirectoryFactory Error closing directory:org.apache.solr.common.SolrException: Timeout waiting for all directory ref counts to be released - gave up waiting on CachedDir<> [junit4] 2>at org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:178) [junit4] 2>at org.apache.solr.core.SolrCore.close(SolrCore.java:1699) [junit4] 2>at org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:139) [junit4] 2>at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [junit4] 2>at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [junit4] 2>at java.base/java.lang.Thread.run(Thread.java:834) [junit4] 2> [junit4] 2> 17399 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 ] o.a.s.c.SolrCore java.lang.AssertionError: 2 [junit4] 2>at org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:192) [junit4] 2>at org.apache.solr.core.SolrCore.close(SolrCore.java:1699) [junit4] 2>at org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:139) [junit4] 2>at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [junit4] 2>at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [junit4] 2>at java.base/java.lang.Thread.run(Thread.java:834) [junit4] 2> [junit4] 2> 17399 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 ] o.a.s.c.SolrCores Error shutting down core:java.lang.AssertionError: 2 [junit4] 2>at org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:192) [junit4] 2>at org.apache.solr.core.SolrCore.close(SolrCore.java:1699) [junit4] 2>at org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:139) [junit4] 2>at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [junit4] 2>at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [junit4] 2>at java.base/java.lang.Thread.run(Thread.java:834) [junit4] 2> ... [junit4] 2> 78497 INFO (SUITE-AtomicUpdateProcessorFactoryTest-seed#[4E875A6AF0417D9C]-worker) [ ] o.a.s.SolrTestCaseJ4 --- Done waiting for tracked resources to be released [junit4] 2> NOTE: test params are: codec=Lucene80, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@917add1), locale=sr-Cyrl-ME, timezone=Canada/Saskatchewan [junit4] 2> NOTE: Linux 5.0.0-27-generic amd64/AdoptOpenJDK 11.0.4 (64-bit)/cpus=8,threads=2,free=407897088,total=522190848 [junit4] 2> NOTE: All tests run in this JVM: [AtomicUpdateProcessorFactoryTest] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=AtomicUpdateProcessorFactoryTest -Dtests.seed=4E875A6AF0417D9C -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=sr-Cyrl-ME -Dtests.timezone=Canada/Saskatchewan -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.00s | AtomicUpdateProcessorFactoryTest (suite) <<< [junit4]> Throwable #1: java.lang.AssertionError: ObjectTracker found 6 object(s) that were not released!!! [SolrCore, MockDirectoryWrapper,
[jira] [Resolved] (LUCENE-8917) Remove the "Direct" doc-value format
[ https://issues.apache.org/jira/browse/LUCENE-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved LUCENE-8917. -- Resolution: Fixed i think we're all good now -- that looks like the only affected test. > Remove the "Direct" doc-value format > > > Key: LUCENE-8917 > URL: https://issues.apache.org/jira/browse/LUCENE-8917 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: master (9.0) > > > This is the last user of the Legacy*DocValues APIs. Another option would be > to move this format to doc-value iterators, but I don't think it's worth the > effort: let's just remove it in Lucene 9? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-8917) Remove the "Direct" doc-value format
[ https://issues.apache.org/jira/browse/LUCENE-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened LUCENE-8917: -- this seems to have caused some reliable solr test failures? Example... https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3683/ {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestCodecSupport -Dtests.method=testDynamicFieldsDocValuesFormats -Dtests.seed=FA28EF8B1D76D0FE -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=ru -Dtests.timezone=Europe/Tiraspol -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.04s J1 | TestCodecSupport.testDynamicFieldsDocValuesFormats <<< [junit4]> Throwable #1: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.DocValuesFormat with name 'Direct' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [Asserting, Lucene70, Lucene80] [junit4]>at __randomizedtesting.SeedInfo.seed([FA28EF8B1D76D0FE:1AFBB14D0BE866AA]:0) [junit4]>at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:116) [junit4]>at org.apache.lucene.codecs.DocValuesFormat.forName(DocValuesFormat.java:108) [junit4]>at org.apache.solr.core.SchemaCodecFactory$1.getDocValuesFormatForField(SchemaCodecFactory.java:112) [junit4]>at org.apache.lucene.codecs.lucene80.Lucene80Codec$2.getDocValuesFormatForField(Lucene80Codec.java:74) [junit4]>at org.apache.solr.core.TestCodecSupport.testDynamicFieldsDocValuesFormats(TestCodecSupport.java:87) ... {noformat} ..probably just some tests that need removed/updated to not try to use Direct as an option anymore? > Remove the "Direct" doc-value format > > > Key: LUCENE-8917 > URL: https://issues.apache.org/jira/browse/LUCENE-8917 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: master (9.0) > > > This is the last user of the Legacy*DocValues APIs. Another option would be > to move this format to doc-value iterators, but I don't think it's worth the > effort: let's just remove it in Lucene 9? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13741) possible AuditLogger bugs uncovered while hardening AuditLoggerIntegrationTest
[ https://issues.apache.org/jira/browse/SOLR-13741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13741: Attachment: SOLR-13741.patch Status: Open (was: Open) Attaching my patch, note that at the moment this patch only modifies {{AuditLoggerIntegrationTest}} and does not yet address the '#1' comment I made above regarding the 'delay' option on {{CallbackAuditLoggerPlugin}} – there are additional nocommit comments regarding the planned changes for that, but I didn't want to start on those changes until these existing uncertainties were addressed. [~janhoy] : I would greatly appreciate your review here to help clear up the "correct test, bad behavior" vs "correct behavior, bad test" questions. > possible AuditLogger bugs uncovered while hardening AuditLoggerIntegrationTest > -- > > Key: SOLR-13741 > URL: https://issues.apache.org/jira/browse/SOLR-13741 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13741.patch > > > A while back i saw a weird non-reproducible failure from > AuditLoggerIntegrationTest. When i started reading through that code, 2 > things jumped out at me: > # the way the 'delay' option works is brittle, and makes assumptions about > CPU scheduling that aren't neccessarily going to be true (and also suffers > from the problem that Thread.sleep isn't garunteed to sleep as long as you > ask it too) > # the way the existing {{waitForAuditEventCallbacks(number)}} logic works by > checking the size of a (List) {{buffer}} of recieved events in a sleep/poll > loop, until it contains at least N items -- but the code that adds items to > that buffer in the async Callback thread async _before_ the code that updates > other state variables (like the global {{count}} and the patch specific > {{resourceCounts}}) meaning that a test waiting on 3 events could "see" 3 > events added to the buffer, but calling {{assertEquals(3, > receiver.getTotalCount())}} could subsequently fail because that variable > hadn't been udpated yet. > #2 was the source of the failures I was seeing, and while a quick fix for > that specific problem would be to update all other state _before_ adding the > event to the buffer, I set out to try and make more general improvements to > the test: > * eliminate the dependency on sleep loops by {{await}}-ing on concurrent data > structures > * harden the assertions made about the expected events recieved (updating > some test methods that currently just assert the number of events recieved) > * add new assertions that _only_ the expected events are recieved. > In the process of doing this, I've found several oddities/descrepencies > between things the test currently claims/asserts, and what *actually* happens > under more rigerous scrutiny/assertions. > I'll attach a patch shortly that has my (in progress) updates and inlcudes > copious nocommits about things seem suspect. the summary of these concerns > is: > * SolrException status codes that do not match what the existing test says > they should (but doesn't assert) > * extra AuditEvents occuring that the existing test does not expect > * AuditEvents for incorrect credentials that do not at all match the expected > AuditEvent in the existing test -- which the current test seems to miss in > it's assertions because it's picking up some extra events from triggered by > previuos requests earlier in the test that just happen to also match the > asserctions. > ...it's not clear to me if the test logic is correct and these are "code > bugs" or if the test is faulty. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13741) possible AuditLogger bugs uncovered while hardening AuditLoggerIntegrationTest
Hoss Man created SOLR-13741: --- Summary: possible AuditLogger bugs uncovered while hardening AuditLoggerIntegrationTest Key: SOLR-13741 URL: https://issues.apache.org/jira/browse/SOLR-13741 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Assignee: Hoss Man A while back i saw a weird non-reproducible failure from AuditLoggerIntegrationTest. When i started reading through that code, 2 things jumped out at me: # the way the 'delay' option works is brittle, and makes assumptions about CPU scheduling that aren't neccessarily going to be true (and also suffers from the problem that Thread.sleep isn't garunteed to sleep as long as you ask it too) # the way the existing {{waitForAuditEventCallbacks(number)}} logic works by checking the size of a (List) {{buffer}} of recieved events in a sleep/poll loop, until it contains at least N items -- but the code that adds items to that buffer in the async Callback thread async _before_ the code that updates other state variables (like the global {{count}} and the patch specific {{resourceCounts}}) meaning that a test waiting on 3 events could "see" 3 events added to the buffer, but calling {{assertEquals(3, receiver.getTotalCount())}} could subsequently fail because that variable hadn't been udpated yet. #2 was the source of the failures I was seeing, and while a quick fix for that specific problem would be to update all other state _before_ adding the event to the buffer, I set out to try and make more general improvements to the test: * eliminate the dependency on sleep loops by {{await}}-ing on concurrent data structures * harden the assertions made about the expected events recieved (updating some test methods that currently just assert the number of events recieved) * add new assertions that _only_ the expected events are recieved. In the process of doing this, I've found several oddities/descrepencies between things the test currently claims/asserts, and what *actually* happens under more rigerous scrutiny/assertions. I'll attach a patch shortly that has my (in progress) updates and inlcudes copious nocommits about things seem suspect. the summary of these concerns is: * SolrException status codes that do not match what the existing test says they should (but doesn't assert) * extra AuditEvents occuring that the existing test does not expect * AuditEvents for incorrect credentials that do not at all match the expected AuditEvent in the existing test -- which the current test seems to miss in it's assertions because it's picking up some extra events from triggered by previuos requests earlier in the test that just happen to also match the asserctions. ...it's not clear to me if the test logic is correct and these are "code bugs" or if the test is faulty. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13709) Race condition on core reload while core is still loading?
[ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921632#comment-16921632 ] Hoss Man edited comment on SOLR-13709 at 9/3/19 6:27 PM: - -Commit 86e8c44be472556c8a905deb338cafa803ee6ee0 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter- -[ [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86e8c44] ]- -SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified- -(cherry picked from commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93)- EDIT: not actually relevant to this issue, sorry. was (Author: jira-bot): Commit 86e8c44be472556c8a905deb338cafa803ee6ee0 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86e8c44 ] SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified (cherry picked from commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93) > Race condition on core reload while core is still loading? > -- > > Key: SOLR-13709 > URL: https://issues.apache.org/jira/browse/SOLR-13709 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Erick Erickson >Priority: Major > Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt > > > A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that > there may be a race condition when attempting to re-load a SolrCore while the > core is currently in the process of (re)loading that can leave the SolrCore > in an unusable state. > Details to follow... -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13709) Race condition on core reload while core is still loading?
[ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921619#comment-16921619 ] Hoss Man edited comment on SOLR-13709 at 9/3/19 6:27 PM: - -Commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93 in lucene-solr's branch refs/heads/master from Chris M. Hostetter- -[ [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=83cd54f] ]- -SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified- EDIT: not actually relevant to this issue, sorry. was (Author: jira-bot): Commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93 in lucene-solr's branch refs/heads/master from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=83cd54f ] SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified > Race condition on core reload while core is still loading? > -- > > Key: SOLR-13709 > URL: https://issues.apache.org/jira/browse/SOLR-13709 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Erick Erickson >Priority: Major > Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt > > > A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that > there may be a race condition when attempting to re-load a SolrCore while the > core is currently in the process of (re)loading that can leave the SolrCore > in an unusable state. > Details to follow... -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param
[ https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921644#comment-16921644 ] Hoss Man commented on SOLR-13717: - Gah, juggling too many tabs/issues at the same time. Primary commits related to this issue... * master: https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=83cd54f ** 83cd54f80157916b364bb5ebde20a66cbd5d3d93 * branch_8x: https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86e8c44 ** 86e8c44be472556c8a905deb338cafa803ee6ee0 > Distributed Grouping breaks multi valued 'fl' param > --- > > Key: SOLR-13717 > URL: https://issues.apache.org/jira/browse/SOLR-13717 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: master (9.0), 8.3 > > Attachments: SOLR-13717.patch, SOLR-13717.patch > > > Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} > params are specified. > {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that > assumes there will be 0 or 1 {{fl}} params in the original request that it > should inspect to see if it needs to append (via string concat) the uniqueKey > field onto in order to collate the returned stored fields into their > respective (grouped) documents -- and then ignores any additional {{fl}} > params that may exist in the original request when it does so. > The net result is that only the uniqueKey field and whatever fields _are_ > specified in the first {{fl}} param specified are fetched from each shard and > ultimately returned. > The only workaround is to replace multiple {{fl}} params with a single {{fl}} > param containing a comma seperated list of the requested fields. > > Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing > these requests which should all be equivilent... > {noformat} > $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv > ... > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author,name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "name":["The Black Company"], > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "name":["Foundation"], > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author&fl=name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=id&fl=author&fl=name&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0553573403"}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354"}] > }}]}}} > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param
[ https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13717: Fix Version/s: 8.3 master (9.0) Resolution: Fixed Status: Resolved (was: Patch Available) Christine: thanks for the review, and for catching & fixing my test laziness. much cleaner. > Distributed Grouping breaks multi valued 'fl' param > --- > > Key: SOLR-13717 > URL: https://issues.apache.org/jira/browse/SOLR-13717 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: master (9.0), 8.3 > > Attachments: SOLR-13717.patch, SOLR-13717.patch > > > Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} > params are specified. > {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that > assumes there will be 0 or 1 {{fl}} params in the original request that it > should inspect to see if it needs to append (via string concat) the uniqueKey > field onto in order to collate the returned stored fields into their > respective (grouped) documents -- and then ignores any additional {{fl}} > params that may exist in the original request when it does so. > The net result is that only the uniqueKey field and whatever fields _are_ > specified in the first {{fl}} param specified are fetched from each shard and > ultimately returned. > The only workaround is to replace multiple {{fl}} params with a single {{fl}} > param containing a comma seperated list of the requested fields. > > Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing > these requests which should all be equivilent... > {noformat} > $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv > ... > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author,name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "name":["The Black Company"], > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "name":["Foundation"], > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author&fl=name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=id&fl=author&fl=name&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0553573403"}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354"}] > }}]}}} > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?
[ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921597#comment-16921597 ] Hoss Man commented on SOLR-13709: - just to be clear, my primary concern when i created this issue was that it was evident from the test failure logs that core reloading (and as erick points out: potentially other core level ops) could occur in a race condition with the core itself loading. my comments about {{SolrCores.getCoreDescriptor(String)}} and if/when/why/how it should block on attempts to ccess a core by name if/while that core was loading were based *solely* on the exsting javadocs for that method. if those javadocs are and have always been wrong, then trying to "fix" that method to match the javadocs isn't necessarily the best solution -- especially if doing so causes lots of other problems. we can always just update the javadocs, making a note of when/why/how the value may be null, and audit the callers to ensure they are accounting for the possibility of null and handling that value in whatever way makes the most sense for the situation (throw NPE, throw a diff exception, fail a command, etc...) i should point out, i have no idea if a "user level" Core RELOAD (or SWAP or UNLOAD) op (ie: something triggered externally via /admin/cores, or via overseer) also has this problem, or already accounts for the possibility that a core may not yet be loaded -- it may simply be that this particular ZkWatcher that registered by the core to watch the schema is itself broken, and should be checking some more explicit state to block and take no action until the core is fully loaded. As far as testing... [~erickerickson] - it's not really clear to me what/where/how you're currently trying to test this? ... as i mentioned, it's kind of a fluke that TestSolrCLIRunExample triggered this failure at all, and even when it did it didn't really "fail" in a reliable way that was oviously related to this specifit bug. I would suggest that a more robust way to test this would be with a more targeted non-cloud test, using a custom plugin (searcher handler, component, whatever...) that spins up a background thread to trigger schema updates in ZK (so that the problematic watcher which does a core reload on schema changes will then fire) and then the custom component should "stall" for some amount of time (ideally {{await}}-ing on something instead of an arbitrary sleep, but i haven't thought it through enough to know what exact condition it could await on) to force a delay in the completeion of the SolrCore loading. Then your test just tries to initialize a SolrCore with a config that uses this custom plugin, and asserts that the SolrCore initializes fine *AND* that it (eventually) picks up the updated schema (via polling on the schema API?) make sense? > Race condition on core reload while core is still loading? > -- > > Key: SOLR-13709 > URL: https://issues.apache.org/jira/browse/SOLR-13709 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Erick Erickson >Priority: Major > Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt > > > A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that > there may be a race condition when attempting to re-load a SolrCore while the > core is currently in the process of (re)loading that can leave the SolrCore > in an unusable state. > Details to follow... -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param
[ https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13717: Status: Patch Available (was: Open) > Distributed Grouping breaks multi valued 'fl' param > --- > > Key: SOLR-13717 > URL: https://issues.apache.org/jira/browse/SOLR-13717 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13717.patch > > > Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} > params are specified. > {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that > assumes there will be 0 or 1 {{fl}} params in the original request that it > should inspect to see if it needs to append (via string concat) the uniqueKey > field onto in order to collate the returned stored fields into their > respective (grouped) documents -- and then ignores any additional {{fl}} > params that may exist in the original request when it does so. > The net result is that only the uniqueKey field and whatever fields _are_ > specified in the first {{fl}} param specified are fetched from each shard and > ultimately returned. > The only workaround is to replace multiple {{fl}} params with a single {{fl}} > param containing a comma seperated list of the requested fields. > > Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing > these requests which should all be equivilent... > {noformat} > $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv > ... > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author,name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "name":["The Black Company"], > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "name":["Foundation"], > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author&fl=name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=id&fl=author&fl=name&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0553573403"}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354"}] > }}]}}} > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param
[ https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13717: Attachment: SOLR-13717.patch Status: Open (was: Open) Attached patch includes a fix and some new test coverage of distributed grouping w/various options comparing the results when using a single {{fl}} vs equivilent multivalued {{fl}} params. > Distributed Grouping breaks multi valued 'fl' param > --- > > Key: SOLR-13717 > URL: https://issues.apache.org/jira/browse/SOLR-13717 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13717.patch > > > Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} > params are specified. > {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that > assumes there will be 0 or 1 {{fl}} params in the original request that it > should inspect to see if it needs to append (via string concat) the uniqueKey > field onto in order to collate the returned stored fields into their > respective (grouped) documents -- and then ignores any additional {{fl}} > params that may exist in the original request when it does so. > The net result is that only the uniqueKey field and whatever fields _are_ > specified in the first {{fl}} param specified are fetched from each shard and > ultimately returned. > The only workaround is to replace multiple {{fl}} params with a single {{fl}} > param containing a comma seperated list of the requested fields. > > Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing > these requests which should all be equivilent... > {noformat} > $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv > ... > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author,name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "name":["The Black Company"], > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "name":["Foundation"], > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author&fl=name,id&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0812521390", > "author":["Glen Cook"]}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354", > "author":["Isaac Asimov"]}] > }}]}}} > $ curl > 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=id&fl=author&fl=name&q=*:*&group=true&group.field=genre_s' > { > "grouped":{ > "genre_s":{ > "matches":10, > "groups":[{ > "groupValue":"fantasy", > "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ > { > "id":"0553573403"}] > }}, > { > "groupValue":"scifi", > "doclist":{"numFound":2,"start":0,"docs":[ > { > "id":"0553293354"}] > }}]}}} > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param
Hoss Man created SOLR-13717: --- Summary: Distributed Grouping breaks multi valued 'fl' param Key: SOLR-13717 URL: https://issues.apache.org/jira/browse/SOLR-13717 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Assignee: Hoss Man Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} params are specified. {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that assumes there will be 0 or 1 {{fl}} params in the original request that it should inspect to see if it needs to append (via string concat) the uniqueKey field onto in order to collate the returned stored fields into their respective (grouped) documents -- and then ignores any additional {{fl}} params that may exist in the original request when it does so. The net result is that only the uniqueKey field and whatever fields _are_ specified in the first {{fl}} param specified are fetched from each shard and ultimately returned. The only workaround is to replace multiple {{fl}} params with a single {{fl}} param containing a comma seperated list of the requested fields. Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing these requests which should all be equivilent... {noformat} $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv ... $ curl 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author,name,id&q=*:*&group=true&group.field=genre_s' { "grouped":{ "genre_s":{ "matches":10, "groups":[{ "groupValue":"fantasy", "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ { "id":"0812521390", "name":["The Black Company"], "author":["Glen Cook"]}] }}, { "groupValue":"scifi", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"0553293354", "name":["Foundation"], "author":["Isaac Asimov"]}] }}]}}} $ curl 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=author&fl=name,id&q=*:*&group=true&group.field=genre_s' { "grouped":{ "genre_s":{ "matches":10, "groups":[{ "groupValue":"fantasy", "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ { "id":"0812521390", "author":["Glen Cook"]}] }}, { "groupValue":"scifi", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"0553293354", "author":["Isaac Asimov"]}] }}]}}} $ curl 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true&indent=true&fl=id&fl=author&fl=name&q=*:*&group=true&group.field=genre_s' { "grouped":{ "genre_s":{ "matches":10, "groups":[{ "groupValue":"fantasy", "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[ { "id":"0553573403"}] }}, { "groupValue":"scifi", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"0553293354"}] }}]}}} {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?
[ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914665#comment-16914665 ] Hoss Man commented on SOLR-13709: - {quote}Is there a possibility that this is happening before CoreContainer.load() is finished? {quote} it's absolutely possible – that's the point i made when i created this issue: bq. ...AFAICT the only way this NPE is possily is if the CoreDescriptor for the original SolrCore is NULL at the time the watcher fires, and the only concievable way that seems to be possible is if the original SolrCore hadn't completley finished loading. Acording to the docs of that method it _should_ block until the core is loaded, but the ZkWatcher thread in question -- set by the SolrCore durring it's own init in order to reload the core if the schema changes -- is calling {{getCoreDescriptor()}} in order to reload the core and getting null. > Race condition on core reload while core is still loading? > -- > > Key: SOLR-13709 > URL: https://issues.apache.org/jira/browse/SOLR-13709 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Erick Erickson >Priority: Major > Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt > > > A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that > there may be a race condition when attempting to re-load a SolrCore while the > core is currently in the process of (re)loading that can leave the SolrCore > in an unusable state. > Details to follow... -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13709) Race condition on core reload while core is still loading?
[ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13709: Attachment: apache_Lucene-Solr-Tests-8.x_449.log.txt Status: Open (was: Open) I've attached the logs from the jenkins run in question... Interestingly: Even though the logs indicate several problems in trying to reload/unload the SolrCore, the test itself didn't seem to care enough about the state of the collection to notice the problems – the only junit failure recorded was a suite level failure from the ObjectTracker due to unreleased threads/objects. The first sign of trouble in the logs is this WARN from a ZK watcher registered to monitor the schema for changes (in schemaless mode) in order to re-load the SolrCore – it fails with an NullPointerException... {noformat} [junit4] 2> 4309877 WARN (Thread-7591) [n:localhost:38920_solr c:testCloudExamplePrompt s:shard2 r:core_node7 x:testCloudExamplePrompt_shard2_replica_n4 ] o.a.s.c.ZkController liste ner throws error [junit4] 2> => org.apache.solr.common.SolrException: Unable to reload core [testCloudExamplePrompt_shard2_replica_n6] [junit4] 2>at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557) [junit4] 2> org.apache.solr.common.SolrException: Unable to reload core [testCloudExamplePrompt_shard2_replica_n6] [junit4] 2>at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557) ~[java/:?] [junit4] 2>at org.apache.solr.core.SolrCore.lambda$getConfListener$21(SolrCore.java:3099) ~[java/:?] [junit4] 2>at org.apache.solr.cloud.ZkController.lambda$fireEventListeners$14(ZkController.java:2514) ~[java/:?] [junit4] 2>at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] [junit4] 2> Caused by: java.lang.NullPointerException [junit4] 2>at org.apache.solr.core.CoreDescriptor.(CoreDescriptor.java:172) ~[java/:?] [junit4] 2>at org.apache.solr.core.SolrCore.reload(SolrCore.java:683) ~[java/:?] [junit4] 2>at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1507) ~[java/:?] [junit4] 2>... 3 more {noformat} ...AFAICT the only way this NPE is possily is if the CoreDescriptor for the _original_ SolrCore is NULL at the time the watcher fires, and the only concievable way that seems to be possible is if the original SolrCore hadn't completley finished loading. Aparently as a result of this failure to reload, a SolrCoreInitializationException is recorded for the core name, and that ultimately causes a fast-failure response when trying to unload the core... {noformat} [junit4] 2> 4310314 ERROR (qtp373709619-50629) [n:localhost:38920_solr x:testCloudExamplePrompt_shard2_replica_n6 ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException : Error unregistering core [testCloudExamplePrompt_shard2_replica_n6] from cloud state [junit4] 2>at org.apache.solr.core.CoreContainer.unload(CoreContainer.java:1672) [junit4] 2>at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$1(CoreAdminOperation.java:105) [junit4] 2>at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360) [junit4] 2>at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397) [junit4] 2>at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181) [junit4] 2>at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:200) [junit4] 2>at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:820) [junit4] 2>at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:786) [junit4] 2>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:546) [junit4] 2>at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423) [junit4] 2>at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350) [junit4] 2>at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) [junit4] 2>at org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:165) [junit4] 2>at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) [junit4] 2>at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) [junit4] 2>at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) [junit4] 2>at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711) [junit4] 2>at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle
[jira] [Created] (SOLR-13709) Race condition on core reload while core is still loading?
Hoss Man created SOLR-13709: --- Summary: Race condition on core reload while core is still loading? Key: SOLR-13709 URL: https://issues.apache.org/jira/browse/SOLR-13709 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that there may be a race condition when attempting to re-load a SolrCore while the core is currently in the process of (re)loading that can leave the SolrCore in an unusable state. Details to follow... -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests
[ https://issues.apache.org/jira/browse/SOLR-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-13701. - Fix Version/s: 8.3 master (9.0) Resolution: Fixed > JWTAuthPlugin calls authenticationFailure (which calls > HttpServletResponsesendError) before updating metrics - breaks tests > --- > > Key: SOLR-13701 > URL: https://issues.apache.org/jira/browse/SOLR-13701 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: master (9.0), 8.3 > > Attachments: SOLR-13701.patch > > > The way JWTAuthPlugin is currently implemented, any failures are sent to the > remote client (via {{authenticationFailure(...)}} which calls > {{HttpServletResponsesendError(...)}}) *before* > {{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics > (like {{numErrors}} and {{numWrongCredentials}}) > This causes a race condition in tests where test threads can: > * see an error response/Exception before the server thread has updated > metrics (like {{numErrors}} and {{numWrongCredentials}}) > * call white box methods like > {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected > metrics > ...all before the server thread has ever gotten around to being able to > update the metrics in question. > {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some > {{"First metrics count assert failed, pausing 2s before re-attempt"}} > evidently to try and work around this bug, but it's still no garuntee that > the server thread will be scheduled before the retry happens. > We can/should just fix JWTAuthPlugin to ensure the metrics are updated before > {{authenticationFailure(...)}} is called, and then remove the "pausing 2s > before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug > fix, and the existing work around for SOLR-13464, there should be absolutely > no reason to "retry" reading hte metrics. > (NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method > that also calls {{HttpServletResponse.sendError(...)}} - but it already > (correctly) updates the error/failure metrics *before* calling that method. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified
[ https://issues.apache.org/jira/browse/SOLR-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-13700. - Fix Version/s: 8.3 master (9.0) Resolution: Fixed > Race condition in initializing metrics for new security plugins when > security.json is modified > -- > > Key: SOLR-13700 > URL: https://issues.apache.org/jira/browse/SOLR-13700 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: master (9.0), 8.3 > > Attachments: SOLR-13700.patch, SOLR-13700.patch > > > When new security plugins are initialized due to remote API requetss, there > is a delay between "registering" the new plugins for use in methods like > {{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's > volatile {{this.authenticationPlugin}} variable) and when the > {{initializeMetrics(..)}} method is called on these plugins, so that they > continue to use the existing {{Metric}} instances as the plugins they are > replacing. > Because these security plugins maintain local refrences to these Metrics (and > don't "get" them from the MetricRegisty everytime they need to {{inc()}} > them) this means that there is short race condition situation such that > during the window of time after a new plugin instance is put into use, but > before {{initializeMetrics(..)}} is called on them, at these plugins are > responsible for accepting/rejecting requests, and decisions in {{Metric}} > instances that are not registered and subsequently get thrown away (and GCed) > once the CoreContainer gets around to calling {{initializeMetrics(..)}} (and > the plugin starts using the pre-existing metric objects) > > This has some noticible impacts on auth tests on CPU constrained jenkins > machines (even after putting in place SOLR-13464 work arounds) that make > assertions about the metrics recorded. > In real world situations, the impact of this bug on users is minor: for a few > micro/milli-seconds, requests may come in w/o being counted in the auth > metrics -- which may also result in descrepencies between the auth metrics > totals and the overall request metrics. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13650) Support for named global classloaders
[ https://issues.apache.org/jira/browse/SOLR-13650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910870#comment-16910870 ] Hoss Man commented on SOLR-13650: - broke precommit... https://builds.apache.org/job/Lucene-Solr-Tests-master/3590/ {noformat} [forbidden-apis] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [forbidden-apis] in org.apache.solr.handler.TestContainerReqHandler (TestContainerReqHandler.java:586) [forbidden-apis] Scanned 4366 class file(s) for forbidden API invocations (in 6.99s), 1 error(s). {noformat} > Support for named global classloaders > - > > Key: SOLR-13650 > URL: https://issues.apache.org/jira/browse/SOLR-13650 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > {code:json} > curl -X POST -H 'Content-type:application/json' --data-binary ' > { > "add-package": { >"name": "my-package" , > "url" : "http://host:port/url/of/jar";, > "sha512":"" > } > }' http://localhost:8983/api/cluster > {code} > This means that Solr creates a globally accessible classloader with a name > {{my-package}} which contains all the jars of that package. > A component should be able to use the package by using the {{"package" : > "my-package"}}. > eg: > {code:json} > curl -X POST -H 'Content-type:application/json' --data-binary ' > { > "create-searchcomponent": { > "name": "my-searchcomponent" , > "class" : "my.path.to.ClassName", > "package" : "my-package" > } > }' http://localhost:8983/api/c/mycollection/config > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified
[ https://issues.apache.org/jira/browse/SOLR-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13700: Attachment: SOLR-13700.patch Status: Open (was: Open) I've updated the patch to move {{pkiAuthenticationPlugin.initializeMetrics((...)}} so that it is called exactly once immediately after {{new PKIAuthenticationPlugin(...)}}. precommit now passes, i'm still beasting. [~janhoy]: I don't understand this part of your comment... bq. ... Re point 2, your patch deletes the wrong lines for auditloggerPlugin metrics. ... The only lines modified in my patch(es) that mention {{auditloggerPlugin}} is to move the {{auditloggerPlugin.plugin.initializeMetrics(...)}} call into the existing {{initializeAuditloggerPlugin(...)}} as per the point of this jira. Can you please elaborate on what lines you think were wrong to be deleted/moved ? ... ideally with a counter-patch, or suggested new case demonstrating the problem, so there's no ambiguity as to what you mean? > Race condition in initializing metrics for new security plugins when > security.json is modified > -- > > Key: SOLR-13700 > URL: https://issues.apache.org/jira/browse/SOLR-13700 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13700.patch, SOLR-13700.patch > > > When new security plugins are initialized due to remote API requetss, there > is a delay between "registering" the new plugins for use in methods like > {{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's > volatile {{this.authenticationPlugin}} variable) and when the > {{initializeMetrics(..)}} method is called on these plugins, so that they > continue to use the existing {{Metric}} instances as the plugins they are > replacing. > Because these security plugins maintain local refrences to these Metrics (and > don't "get" them from the MetricRegisty everytime they need to {{inc()}} > them) this means that there is short race condition situation such that > during the window of time after a new plugin instance is put into use, but > before {{initializeMetrics(..)}} is called on them, at these plugins are > responsible for accepting/rejecting requests, and decisions in {{Metric}} > instances that are not registered and subsequently get thrown away (and GCed) > once the CoreContainer gets around to calling {{initializeMetrics(..)}} (and > the plugin starts using the pre-existing metric objects) > > This has some noticible impacts on auth tests on CPU constrained jenkins > machines (even after putting in place SOLR-13464 work arounds) that make > assertions about the metrics recorded. > In real world situations, the impact of this bug on users is minor: for a few > micro/milli-seconds, requests may come in w/o being counted in the auth > metrics -- which may also result in descrepencies between the auth metrics > totals and the overall request metrics. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests
[ https://issues.apache.org/jira/browse/SOLR-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909241#comment-16909241 ] Hoss Man commented on SOLR-13701: - the beasting i've done locally so far indicates that between the SOLR-13464 work arounds and this fix in this patch there is no need for the 2s retry... but until we actually remove it will be hard to know if it's hiding other bugs - because we have very little visibility in to how often jenkins builds are passing *only* because of that retry (test logs aren't kept of tests that PASS, so we can't grep for that log message to try and find other situations/bugs we don't currently know about. > JWTAuthPlugin calls authenticationFailure (which calls > HttpServletResponsesendError) before updating metrics - breaks tests > --- > > Key: SOLR-13701 > URL: https://issues.apache.org/jira/browse/SOLR-13701 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13701.patch > > > The way JWTAuthPlugin is currently implemented, any failures are sent to the > remote client (via {{authenticationFailure(...)}} which calls > {{HttpServletResponsesendError(...)}}) *before* > {{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics > (like {{numErrors}} and {{numWrongCredentials}}) > This causes a race condition in tests where test threads can: > * see an error response/Exception before the server thread has updated > metrics (like {{numErrors}} and {{numWrongCredentials}}) > * call white box methods like > {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected > metrics > ...all before the server thread has ever gotten around to being able to > update the metrics in question. > {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some > {{"First metrics count assert failed, pausing 2s before re-attempt"}} > evidently to try and work around this bug, but it's still no garuntee that > the server thread will be scheduled before the retry happens. > We can/should just fix JWTAuthPlugin to ensure the metrics are updated before > {{authenticationFailure(...)}} is called, and then remove the "pausing 2s > before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug > fix, and the existing work around for SOLR-13464, there should be absolutely > no reason to "retry" reading hte metrics. > (NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method > that also calls {{HttpServletResponse.sendError(...)}} - but it already > (correctly) updates the error/failure metrics *before* calling that method. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests
[ https://issues.apache.org/jira/browse/SOLR-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13701: Attachment: SOLR-13701.patch Status: Open (was: Open) Attaching patch that addressess this. I also updates the existing code paths that propogate the request via {{filterChain.doFilter(...)}} to ensure that the associated metrics ( {{numPassThrough}} and/or {{numAuthenticated}} ) are updated *before* {{filterChain.doFilter(...)}} is called, so that they are correct even if a subsequent filter (or ultimately, the SolrCore/RequestHandler) encounters an error or otherwise rejects the request. [~janhoy] - would appreciate if you could review. > JWTAuthPlugin calls authenticationFailure (which calls > HttpServletResponsesendError) before updating metrics - breaks tests > --- > > Key: SOLR-13701 > URL: https://issues.apache.org/jira/browse/SOLR-13701 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13701.patch > > > The way JWTAuthPlugin is currently implemented, any failures are sent to the > remote client (via {{authenticationFailure(...)}} which calls > {{HttpServletResponsesendError(...)}}) *before* > {{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics > (like {{numErrors}} and {{numWrongCredentials}}) > This causes a race condition in tests where test threads can: > * see an error response/Exception before the server thread has updated > metrics (like {{numErrors}} and {{numWrongCredentials}}) > * call white box methods like > {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected > metrics > ...all before the server thread has ever gotten around to being able to > update the metrics in question. > {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some > {{"First metrics count assert failed, pausing 2s before re-attempt"}} > evidently to try and work around this bug, but it's still no garuntee that > the server thread will be scheduled before the retry happens. > We can/should just fix JWTAuthPlugin to ensure the metrics are updated before > {{authenticationFailure(...)}} is called, and then remove the "pausing 2s > before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug > fix, and the existing work around for SOLR-13464, there should be absolutely > no reason to "retry" reading hte metrics. > (NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method > that also calls {{HttpServletResponse.sendError(...)}} - but it already > (correctly) updates the error/failure metrics *before* calling that method. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests
Hoss Man created SOLR-13701: --- Summary: JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests Key: SOLR-13701 URL: https://issues.apache.org/jira/browse/SOLR-13701 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Assignee: Hoss Man The way JWTAuthPlugin is currently implemented, any failures are sent to the remote client (via {{authenticationFailure(...)}} which calls {{HttpServletResponsesendError(...)}}) *before* {{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics (like {{numErrors}} and {{numWrongCredentials}}) This causes a race condition in tests where test threads can: * see an error response/Exception before the server thread has updated metrics (like {{numErrors}} and {{numWrongCredentials}}) * call white box methods like {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected metrics ...all before the server thread has ever gotten around to being able to update the metrics in question. {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some {{"First metrics count assert failed, pausing 2s before re-attempt"}} evidently to try and work around this bug, but it's still no garuntee that the server thread will be scheduled before the retry happens. We can/should just fix JWTAuthPlugin to ensure the metrics are updated before {{authenticationFailure(...)}} is called, and then remove the "pausing 2s before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug fix, and the existing work around for SOLR-13464, there should be absolutely no reason to "retry" reading hte metrics. (NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method that also calls {{HttpServletResponse.sendError(...)}} - but it already (correctly) updates the error/failure metrics *before* calling that method. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified
[ https://issues.apache.org/jira/browse/SOLR-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13700: Assignee: Hoss Man Attachment: SOLR-13700.patch Status: Open (was: Open) Attaching a patch to address this, one nocommit regarding something that makes no sense to me... [~ab] - can you please review and sanity check: # that my understanding is correct, and that it's "safe" (and more correct) for the "old" and "new" instances of these plugins to be using the same Metric instances before the "new" plugin replaces the old one # the nocommit comments -- unless i'm missing something {{reloadSecurityProperties()}} has no business calling {{pkiAuthenticationPlugin.initializeMetrics(...)}}, because {{pkiAuthenticationPlugin}} can never change as a result of reloading the security.json ... so {{pkiAuthenticationPlugin.initializeMetrics(...)}} should be called exactly once (and only once) for it's entire lifecycle ... ideally in {{CoreContainer.load()}} immediately after calling {{pkiAuthenticationPlugin = new PKIAuthenticationPlugin...)}} > Race condition in initializing metrics for new security plugins when > security.json is modified > -- > > Key: SOLR-13700 > URL: https://issues.apache.org/jira/browse/SOLR-13700 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13700.patch > > > When new security plugins are initialized due to remote API requetss, there > is a delay between "registering" the new plugins for use in methods like > {{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's > volatile {{this.authenticationPlugin}} variable) and when the > {{initializeMetrics(..)}} method is called on these plugins, so that they > continue to use the existing {{Metric}} instances as the plugins they are > replacing. > Because these security plugins maintain local refrences to these Metrics (and > don't "get" them from the MetricRegisty everytime they need to {{inc()}} > them) this means that there is short race condition situation such that > during the window of time after a new plugin instance is put into use, but > before {{initializeMetrics(..)}} is called on them, at these plugins are > responsible for accepting/rejecting requests, and decisions in {{Metric}} > instances that are not registered and subsequently get thrown away (and GCed) > once the CoreContainer gets around to calling {{initializeMetrics(..)}} (and > the plugin starts using the pre-existing metric objects) > > This has some noticible impacts on auth tests on CPU constrained jenkins > machines (even after putting in place SOLR-13464 work arounds) that make > assertions about the metrics recorded. > In real world situations, the impact of this bug on users is minor: for a few > micro/milli-seconds, requests may come in w/o being counted in the auth > metrics -- which may also result in descrepencies between the auth metrics > totals and the overall request metrics. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified
Hoss Man created SOLR-13700: --- Summary: Race condition in initializing metrics for new security plugins when security.json is modified Key: SOLR-13700 URL: https://issues.apache.org/jira/browse/SOLR-13700 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man When new security plugins are initialized due to remote API requetss, there is a delay between "registering" the new plugins for use in methods like {{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's volatile {{this.authenticationPlugin}} variable) and when the {{initializeMetrics(..)}} method is called on these plugins, so that they continue to use the existing {{Metric}} instances as the plugins they are replacing. Because these security plugins maintain local refrences to these Metrics (and don't "get" them from the MetricRegisty everytime they need to {{inc()}} them) this means that there is short race condition situation such that during the window of time after a new plugin instance is put into use, but before {{initializeMetrics(..)}} is called on them, at these plugins are responsible for accepting/rejecting requests, and decisions in {{Metric}} instances that are not registered and subsequently get thrown away (and GCed) once the CoreContainer gets around to calling {{initializeMetrics(..)}} (and the plugin starts using the pre-existing metric objects) This has some noticible impacts on auth tests on CPU constrained jenkins machines (even after putting in place SOLR-13464 work arounds) that make assertions about the metrics recorded. In real world situations, the impact of this bug on users is minor: for a few micro/milli-seconds, requests may come in w/o being counted in the auth metrics -- which may also result in descrepencies between the auth metrics totals and the overall request metrics. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13696) DimensionalRoutedAliasUpdateProcessorTest / RoutedAliasUpdateProcessorTest failures due commitWithin/openSearcher delays
[ https://issues.apache.org/jira/browse/SOLR-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907644#comment-16907644 ] Hoss Man commented on SOLR-13696: - Gus: can you please take a look at this? based on my assessment, here's the crucial bits of the log.. {noformat} hossman@tray:~/tmp/jenkins/DimensionalRoutedAliasUpdateProcessorTest$ grep testTimeCat__TRA__2019-07-05__CRA__calico thetaphi_Lucene-Solr-8.x-MacOSX_272.log.txt | egrep '(Opening \[Searcher|add=\[21|fq=cat_s:calico|\{\!terms\+f%3Did}21,20.*hits=2)' [junit4] 2> 4476175 INFO (qtp759508539-75005) [n:127.0.0.1:55915_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard1 r:core_node5 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n2 ] o.a.s.s.SolrIndexSearcher Opening [Searcher@9bc49f7[testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n2] main] [junit4] 2> 4476176 INFO (qtp1536738594-75022) [n:127.0.0.1:55916_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard1 r:core_node3 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1 ] o.a.s.s.SolrIndexSearcher Opening [Searcher@5547583d[testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1] main] [junit4] 2> 4476186 INFO (qtp1998715126-75500) [n:127.0.0.1:55917_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] o.a.s.s.SolrIndexSearcher Opening [Searcher@3d40f0e1[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7] main] [junit4] 2> 4476195 INFO (qtp927691752-75020) [n:127.0.0.1:55918_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node6 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4 ] o.a.s.s.SolrIndexSearcher Opening [Searcher@18c82bc1[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4] main] [junit4] 2> 4477375 INFO (qtp1998715126-75016) [n:127.0.0.1:55917_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] o.a.s.u.p.LogUpdateProcessorFactory [testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7] webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:55918/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4/&wt=javabin&version=2}{add=[21 (1641811095092985856)]} 0 2 [junit4] 2> 4477960 INFO (qtp927691752-75506) [n:127.0.0.1:55918_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node6 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4 ] o.a.s.u.p.LogUpdateProcessorFactory [testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4] webapp=/solr path=/update params={update.distrib=NONE&df=_text_&alias.update.distrib=TOLEADER&distrib.from=http://127.0.0.1:55918/solr/testTimeCat__TRA__2019-07-02__CRA__calico_shard2_replica_n6/&wt=javabin&version=2&processor=inc}{add=[21 (1641811095092985856)]} 0 590 [junit4] 2> 4477962 INFO (commitScheduler-24384-thread-1) [ ] o.a.s.s.SolrIndexSearcher Opening [Searcher@745b6c94[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4] main] [junit4] 2> 4478213 INFO (qtp1998715126-75501) [n:127.0.0.1:55917_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] o.a.s.c.S.Request [testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7] webapp=/solr path=/select params={q={!terms+f%3Did}21,20&rows=0&wt=javabin&version=2} hits=2 status=0 QTime=13 [junit4] 2> 4478408 INFO (qtp1998715126-75016) [n:127.0.0.1:55917_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] o.a.s.c.S.Request [testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7] webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=516&start=0&fsv=true&fq=cat_s:calico&shard.url=http://127.0.0.1:55917/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7/|http://127.0.0.1:55918/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4/&rows=0&version=2&q=*:*&stats=true&omitHeader=false&NOW=1565753074817&isShard=true&wt=javabin&stats.field=timestamp_dt} hits=0 status=0 QTime=0 [junit4] 2> 4478408 INFO (qtp1536738594-75032) [n:127.0.0.1:55916_solr c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard1 r:core_node3 x:testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1 ] o.a.s.c.S.Request [testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1] webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=516&start=0&fsv=true&fq=cat_s:calico&shard.url=http://127.0.0.1:55916/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1/|http://127.0.0.1:55915/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard1_repl
[jira] [Created] (SOLR-13696) DimensionalRoutedAliasUpdateProcessorTest / RoutedAliasUpdateProcessorTest failures due commitWithin/openSearcher delays
Hoss Man created SOLR-13696: --- Summary: DimensionalRoutedAliasUpdateProcessorTest / RoutedAliasUpdateProcessorTest failures due commitWithin/openSearcher delays Key: SOLR-13696 URL: https://issues.apache.org/jira/browse/SOLR-13696 Project: Solr Issue Type: Test Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Assignee: Gus Heck Attachments: thetaphi_Lucene-Solr-8.x-MacOSX_272.log.txt Recent jenkins failure... Build: https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-MacOSX/272/ Java: 64bit/jdk1.8.0 -XX:-UseCompressedOops -XX:+UseParallelGC {noformat} Stack Trace: java.lang.AssertionError: expected:<16> but was:<15> at __randomizedtesting.SeedInfo.seed([DB6DC28D5560B1D2:E295833E1541FDB9]:0) at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:834) at org.junit.Assert.assertEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:631) at org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest.assertCatTimeInvariants(DimensionalRoutedAliasUpdateProcessorTest.java:677 ) at org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest.testTimeCat(DimensionalRoutedAliasUpdateProcessorTest.java:282) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {noformat} Digging into the logs, the problem appears to be in the way the test verifies/assumes docs have been committed. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-13688) Make the bin/solr export command to run one thread per shard
[ https://issues.apache.org/jira/browse/SOLR-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened SOLR-13688: - 8x doesn't compile... https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-MacOSX/271/ {noformat} Build Log: [...truncated 12279 lines...] [javac] Compiling 1284 source files to /Users/jenkins/workspace/Lucene-Solr-8.x-MacOSX/solr/build/solr-core/classes/java [javac] /Users/jenkins/workspace/Lucene-Solr-8.x-MacOSX/solr/core/src/java/org/apache/solr/util/ExportTool.java:312: error: cannot infer type arguments for BiConsumer [javac] private BiConsumer bic= new BiConsumer<>() { [javac] ^ [javac] reason: '<>' with anonymous inner classes is not supported in -source 8 [javac] (use -source 9 or higher to enable '<>' with anonymous inner classes) [javac] where T,U are type-variables: [javac] T extends Object declared in interface BiConsumer [javac] U extends Object declared in interface BiConsumer [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error {noformat} > Make the bin/solr export command to run one thread per shard > > > Key: SOLR-13688 > URL: https://issues.apache.org/jira/browse/SOLR-13688 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Fix For: 8.3 > > > This can be run in parallel with one dedicated thread for each shard and > (distrib=false) option > this will be the only option -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13694) IndexSizeEstimator NullPointerException
Hoss Man created SOLR-13694: --- Summary: IndexSizeEstimator NullPointerException Key: SOLR-13694 URL: https://issues.apache.org/jira/browse/SOLR-13694 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Assignee: Andrzej Bialecki Jenkins found a reproducible seed for trigging an NPE in IndexSizeEstimatorTest Based on a little experimental tracing i did, this might be a real bug in IndexSizeEstimator? ... it's calling close on StoredFieldsReader instances it gets from the CodecReader -- but AFAICT from the docs/code i'm not certain if it should be doing this. It appears the expectation is that this is direct access to the internal state, that will automatically be closed when the CodecReader is closed. ie: IndexSizeEstimator is closing StoredFieldsReader pre-maturely, causing it to be unusbale on the next iteration. (I didn't dig in far enough to guess if there are other places in the IndexSizeEstimator code that are closing CodecReader internals prematurely as well, or just in this situation ... it's also not clear if this only causes failures because this seed uses SimpleTextCodec, and other codecs are more forgiving -- or if something else about the index(es) generated for this seed are what cause the problem to manifest) http://fucit.org/solr-jenkins-reports/job-data/apache/Lucene-Solr-NightlyTests-master/1928 {noformat} hossman@tray:~/lucene/dev/solr/core [j11] [master] $ git rev-parse HEAD 0291db44bc8e092f7cb2f577f0ac8ab6fa6a5fd7 hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtestcase=IndexSizeEstimatorTest -Dtests.method=testEstimator -Dtests.seed=23F60434E13D8FD4 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=eo -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ... [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=IndexSizeEstimatorTest -Dtests.method=testEstimator -Dtests.seed=23F60434E13D8FD4 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=eo -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.88s | IndexSizeEstimatorTest.testEstimator <<< [junit4]> Throwable #1: java.lang.NullPointerException [junit4]>at __randomizedtesting.SeedInfo.seed([23F60434E13D8FD4:EC2B6B666D451E64]:0) [junit4]>at org.apache.lucene.codecs.simpletext.SimpleTextStoredFieldsReader.visitDocument(SimpleTextStoredFieldsReader.java:109) [junit4]>at org.apache.solr.handler.admin.IndexSizeEstimator.estimateStoredFields(IndexSizeEstimator.java:513) [junit4]>at org.apache.solr.handler.admin.IndexSizeEstimator.estimate(IndexSizeEstimator.java:198) [junit4]>at org.apache.solr.handler.admin.IndexSizeEstimatorTest.testEstimator(IndexSizeEstimatorTest.java:117) [junit4]>at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4]>at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [junit4]>at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4]>at java.base/java.lang.reflect.Method.invoke(Method.java:566) [junit4]>at java.base/java.lang.Thread.run(Thread.java:834) {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13464) no way for external clients to detect when changes to security config have taken effect
[ https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13464: Description: The basic functionality of the authorization/authentication REST APIs works by persisting changes to a {{security.json}} file in ZooKeeper which is monitored by every node via a Watcher. When the watchers fire, the affected plugin types are (re)-initialized ith the new settings. Since this information is "pulled" from ZK by the nodes, there is a (small) inherent delay between when the REST API is hit by external clients, and when each node learns of the changes. An additional delay exists as the config is "reloaded" to (re)initialize the plugins. Practically speaking these delays have very little impact on a "real" solr cloud cluster, but they can be problematic in test cases -- while the SecurityConfHandler on each node could be used to query the "current" security.json file, it doesn't indicate if/when the plugins identified in the "current" configuration are fully in use. For now, we have a "white box" work around available for MiniSolrCloudCluster based tests by comparing the Plugins of each CoreContainer in use before and after making known changes via the API (see commits identified below). This issue exists as a placeholder for future consideration of UX/API improvements making it easier for external clients (w/o "white box" access to solr internals) to know definitively if/when modified security settings take effect. {panel:title=original jira description} I've been investigating some sporadic and hard to reproduce test failures related to authentication in cloud mode, and i *think* (but have not directly verified) that the common cause is that after uses one of the {{/admin/auth...}} handlers to update some setting, there is an inherient and unpredictible delay (due to ZK watches) until every node in the cluster has had a chance to (re)load the new configuration and initialize the various security plugins with the new settings. Which means, if a test client does a POST to some node to add/change/remove some authn/authz settings, and then immediately hits the exact same node (or any other node) to test that the effects of those settings exist, there is no garuntee that they will have taken affect yet. {panel} was: I've been investigating some sporadic and hard to reproduce test failures related to authentication in cloud mode, and i *think* (but have not directly verified) that the common cause is that after uses one of the {{/admin/auth...}} handlers to update some setting, there is an inherient and unpredictible delay (due to ZK watches) until every node in the cluster has had a chance to (re)load the new configuration and initialize the various security plugins with the new settings. Which means, if a test client does a POST to some node to add/change/remove some authn/authz settings, and then immediately hits the exact same node (or any other node) to test that the effects of those settings exist, there is no garuntee that they will have taken affect yet. Issue Type: Improvement (was: Bug) Summary: no way for external clients to detect when changes to security config have taken effect (was: Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config) since i was able to come up with a test workaround, i've shifted the type, summary, and description of this Jira to focus on future UX/API improvements for external clients > no way for external clients to detect when changes to security config have > taken effect > --- > > Key: SOLR-13464 > URL: https://issues.apache.org/jira/browse/SOLR-13464 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man >Priority: Major > > The basic functionality of the authorization/authentication REST APIs works > by persisting changes to a {{security.json}} file in ZooKeeper which is > monitored by every node via a Watcher. When the watchers fire, the affected > plugin types are (re)-initialized ith the new settings. > Since this information is "pulled" from ZK by the nodes, there is a (small) > inherent delay between when the REST API is hit by external clients, and when > each node learns of the changes. An additional delay exists as the config is > "reloaded" to (re)initialize the plugins. > Practically speaking these delays have very little impact on a "real" solr > cloud cluster, but they can be problematic in test cases -- while the > SecurityConfHandler on each node could be used to query the "current" > security.json file, it doesn't indicate if/when the plugins identified in the > "current" configuration are fully in use. > For now, we have a "white box" work around available for
[jira] [Commented] (SOLR-9658) Caches should have an optional way to clean if idle for 'x' mins
[ https://issues.apache.org/jira/browse/SOLR-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904200#comment-16904200 ] Hoss Man commented on SOLR-9658: * i should have noticed/mentioned this in the last patch but: any method (including your new {{markAndSweepByIdleTime()}} that that expects to be called only when markAndSweepLock is already held should really start with {{assert markAndSweepLock.isHeldByCurrentThread();}} * this patch still seems to modify TestJavaBinCodec unneccessarily? (now that you re-added the backcompat constructor) * i don't really think it's a good idea to add these {{CacheListener}} / {{EvictionListener}} APIs at this point w/o a lot more consideration of their lifecycle / usage ** I know you introduced them in response to my suggestion to add hooks for monitoring in tests, but they don't _currently_ seem more useful in the tests then some of the specific suggestions i made before (more comments on this below) and the APIs don't seem to be thought through enough to be generally useful later w/o a lot of re-working... *** Examples: if the point of creating {{CacheListener}} now is to be able to add more methods/hooks to it later, then is why is only {{EvictionListener}} passed down to the {{ConcurrentXXXCache}} impls instead of the entire {{CacheListener}} ? *** And why are there 2 distinct {{EvictionListener}} interfaces, instead of just a common one? ** ... so it would probably be safer/cleaner to avoid adding these APIs now since there are simpler alternatives available for the tests? * Re: "...plus adding support for artificially "advancing" the time" ... this seems overly complex? ** None of the suggestions i made for improving the reliability/coverage of the test require needing to fake the "now" clock: just being able to insert synthetic entries into the cache with artifically old timestamps – which could be done by refactoring out the middle of {{put(...)}} into a new {{putCacheEntry(CacheEntry ... )}} method that would let the (test) caller set an arbitrary {{lastAccessed}} value... {code:java} /** * Useable by tests to create synthetic cache entries, also called by {@link #put} * @lucene.internal */ public CacheEntry putCacheEntry(CacheEntry e) { CacheEntry oldCacheEntry = map.put(key, e); int currentSize; if (oldCacheEntry == null) { currentSize = stats.size.incrementAndGet(); ramBytes.addAndGet(e.ramBytesUsed() + HASHTABLE_RAM_BYTES_PER_ENTRY); // added key + value + entry } else { currentSize = stats.size.get(); ramBytes.addAndGet(-oldCacheEntry.ramBytesUsed()); ramBytes.addAndGet(e.ramBytesUsed()); } if (islive) { stats.putCounter.increment(); } else { stats.nonLivePutCounter.increment(); } return oldCacheEntry; } {code} ** ...that way tests could "setup" a cache containing arbitrary entries (of arbitrary size, with arbitrary create/access times that could be from weeks in the past) and then very precisely inspect the results of the cache after calling {{markAndSweep()}} *** or some other new {{triggerCleanupIfNeeded()}} method that can encapsualte all of the existing {{// Check if we need to clear out old entries from the cache ...}} logic currently at the end of {{put()}} * In general, i really think testing of functionality like this should really focus on testing "what exactly happens when markAndSeep() is called on a cache containing a very specific set of values?" indepdnent from "does markAndSweep() get called eventually & automatically if i configure maxIdleTime?" ** the former can be tested w/o the need of any cleanup threads or faking the TimeSource ** the later can be tested w/o the need of a {{CacheListener}} or {{EvictionListener}} API (or a fake TimeSource) – just create an anonymous subclass of {{ConcurrentXXXCache}} who'se markAndSweep() method decrements a CountDownLatch tha the test tread is waiting on ** isolating the testing of these different concepts not only makes it easier to test more complex aspects of how {{markAndSweep()}} is expected to work (ie: "assert exactly which entries are removed if the sum of the sizes == X == (ramUpperWatermark + Y) but the two smallest entries (whose total size = Y + 1) are the only one with an accessTime older then the idleTime") but also makes it easier to understand & debug failures down the road -if- _when_ they happen. *** as things stand in your patch, -if- _when_ the "did not evict entries in time" assert (eventually) trips in a future jenkins build, we won't immediately be able to tell (w/o added logging) if that's because of a bug in the {{CleanupThread}} that prevented if from calling {{markAndSweep();}} or a bug in {{SimTimeSource.advanceMs()}} ; or a bug somewhere in the cache that prevented {{markAndSweep()}} from recognizing those entries were old; or just a heavi
[jira] [Commented] (SOLR-13399) compositeId support for shard splitting
[ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903409#comment-16903409 ] Hoss Man commented on SOLR-13399: - i would assume it's related to the (numSubShards) changes in SplitShardCmd ? At first glance, that code path looks like it's specific to SPLIT_BY_PREFIX, but apparently your previous commit has it defaulting to "true" ? (see SplitShardCmd.java L212) {noformat} $ git show 19ddcfd282f3b9eccc50da83653674e510229960 -- core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java | cat commit 19ddcfd282f3b9eccc50da83653674e510229960 Author: yonik Date: Tue Aug 6 14:09:54 2019 -0400 SOLR-13399: ability to use id field for compositeId histogram diff --git a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java index 4d623be..6c5921e 100644 --- a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java +++ b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java @@ -212,16 +212,14 @@ public class SplitShardCmd implements OverseerCollectionMessageHandler.Cmd { if (message.getBool(CommonAdminParams.SPLIT_BY_PREFIX, true)) { t = timings.sub("getRanges"); -log.info("Requesting split ranges from replica " + parentShardLeader.getName() + " as part of slice " + slice + " of collection " -+ collectionName + " on " + parentShardLeader); - ModifiableSolrParams params = new ModifiableSolrParams(); params.set(CoreAdminParams.ACTION, CoreAdminParams.CoreAdminAction.SPLIT.toString()); params.set(CoreAdminParams.GET_RANGES, "true"); params.set(CommonAdminParams.SPLIT_METHOD, splitMethod.toLower()); params.set(CoreAdminParams.CORE, parentShardLeader.getStr("core")); -int numSubShards = message.getInt(NUM_SUB_SHARDS, DEFAULT_NUM_SUB_SHARDS); -params.set(NUM_SUB_SHARDS, Integer.toString(numSubShards)); +// Only 2 is currently supported +// int numSubShards = message.getInt(NUM_SUB_SHARDS, DEFAULT_NUM_SUB_SHARDS); +// params.set(NUM_SUB_SHARDS, Integer.toString(numSubShards)); { final ShardRequestTracker shardRequestTracker = ocmh.asyncRequestTracker(asyncId); @@ -236,7 +234,7 @@ public class SplitShardCmd implements OverseerCollectionMessageHandler.Cmd { NamedList shardRsp = (NamedList)successes.getVal(0); String splits = (String)shardRsp.get(CoreAdminParams.RANGES); if (splits != null) { - log.info("Resulting split range to be used is " + splits); + log.info("Resulting split ranges to be used: " + splits + " slice=" + slice + " leader=" + parentShardLeader); // change the message to use the recommended split ranges message = message.plus(CoreAdminParams.RANGES, splits); } {noformat} (I could be totally of base though -- i don't really understand 90% of what this test is doing, and the place where it fails doesn't seem to be trying to split into more then 2 subshards, so even if the SplitSHardCmd changes i pointed out are buggy, i'm not sure why it would cause this particular failure) > compositeId support for shard splitting > --- > > Key: SOLR-13399 > URL: https://issues.apache.org/jira/browse/SOLR-13399 > Project: Solr > Issue Type: New Feature >Reporter: Yonik Seeley >Assignee: Yonik Seeley >Priority: Major > Fix For: 8.3 > > Attachments: SOLR-13399.patch, SOLR-13399.patch, > SOLR-13399_testfix.patch, SOLR-13399_useId.patch, > ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt > > > Shard splitting does not currently have a way to automatically take into > account the actual distribution (number of documents) in each hash bucket > created by using compositeId hashing. > We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* > command that would look at the number of docs sharing each compositeId prefix > and use that to create roughly equal sized buckets by document count rather > than just assuming an equal distribution across the entire hash range. > Like normal shard splitting, we should bias against splitting within hash > buckets unless necessary (since that leads to larger query fanout.) . Perhaps > this warrants a parameter that would control how much of a size mismatch is > tolerable before resorting to splitting within a bucket. > *allowedSizeDifference*? > To more quickly calculate the number of docs in each bucket, we could index > the prefix in a different field. Iterating over the terms for this field > would quickly give us the number of docs in each (i.e lucene keeps tra
[jira] [Updated] (SOLR-13399) compositeId support for shard splitting
[ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13399: Attachment: ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt Status: Reopened (was: Reopened) git bisect has identified 19ddcfd282f3b9eccc50da83653674e510229960 as the cause of recent (reproducible) jenkins test failures in ShardSplitTest... https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-NightlyTests-8.x/174/ https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-repro/3507/ (Jenkins found the failures on branch_8x, but i was able to reproduce the same exact seed on master, and used that branch for bisecting. Attaching logs from my local master run.) {noformat} ant test -Dtestcase=ShardSplitTest -Dtests.method=test -Dtests.seed=AE04B5C9BA6E9A4 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=sr-Latn -Dtests.timezone=Etc/GMT-11 -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 {noformat} {noformat} [junit4] FAILURE 273s J2 | ShardSplitTest.test <<< [junit4]> Throwable #1: java.lang.AssertionError: Wrong doc count on shard1_0. See SOLR-5309 expected:<257> but was:<316> [junit4]>at __randomizedtesting.SeedInfo.seed([AE04B5C9BA6E9A4:82B47486355A845C]:0) [junit4]>at org.apache.solr.cloud.api.collections.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:1002) [junit4]>at org.apache.solr.cloud.api.collections.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:794) [junit4]>at org.apache.solr.cloud.api.collections.ShardSplitTest.test(ShardSplitTest.java:111) [junit4]>at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:1082) [junit4]>at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:1054) [junit4]>at java.lang.Thread.run(Thread.java:748) {noformat} > compositeId support for shard splitting > --- > > Key: SOLR-13399 > URL: https://issues.apache.org/jira/browse/SOLR-13399 > Project: Solr > Issue Type: New Feature >Reporter: Yonik Seeley >Assignee: Yonik Seeley >Priority: Major > Fix For: 8.3 > > Attachments: SOLR-13399.patch, SOLR-13399.patch, > SOLR-13399_testfix.patch, SOLR-13399_useId.patch, > ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt > > > Shard splitting does not currently have a way to automatically take into > account the actual distribution (number of documents) in each hash bucket > created by using compositeId hashing. > We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* > command that would look at the number of docs sharing each compositeId prefix > and use that to create roughly equal sized buckets by document count rather > than just assuming an equal distribution across the entire hash range. > Like normal shard splitting, we should bias against splitting within hash > buckets unless necessary (since that leads to larger query fanout.) . Perhaps > this warrants a parameter that would control how much of a size mismatch is > tolerable before resorting to splitting within a bucket. > *allowedSizeDifference*? > To more quickly calculate the number of docs in each bucket, we could index > the prefix in a different field. Iterating over the terms for this field > would quickly give us the number of docs in each (i.e lucene keeps track of > the doc count for each term already.) Perhaps the implementation could be a > flag on the *id* field... something like *indexPrefixes* and poly-fields that > would cause the indexing to be automatically done and alleviate having to > pass in an additional field during indexing and during the call to > *SPLITSHARD*. This whole part is an optimization though and could be split > off into its own issue if desired. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9658) Caches should have an optional way to clean if idle for 'x' mins
[ https://issues.apache.org/jira/browse/SOLR-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902561#comment-16902561 ] Hoss Man commented on SOLR-9658: * i don't see anything that updates {{oldestEntryNs}} except {{markAndSweepByIdleTime}} ? ** this means that {{markAndSweep()}} may unneccessarily call {{markAndSweepByIdleTime()}} (looping over every entry) even if everything older then the maxIdleTime has already been purged by earlier method calls like {{markAndSweepByCacheSize()}} or {{markAndSweepByRamSize()}} ** off the top of my head, i can't think of an efficient way to "update" {{oldestEntryNs}} in some place like {{postRemoveEntry()}} w/o scanning every cache entry again, but... ** why not move {{markAndSweepByIdleTime()}} _before_ {{markAndSweepByCacheSize()}} and {{markAndSweepByRamSize()}} ? *** since the {{postRemoveEntry()}} calls made as a result of any eviction due to idle time *can* (and already do) efficiently update the results of {{size()}} and {{ramBytesUsed()}} that could potentially save the need for those additional scans of the cache in many situations. * rather then complicating the patch by changing the constructor of the {{CleanupThread}} class(es) to take in the maxIdle values directly, why not read that info from a (new) method on the ConcurrentXXXCache objects already passed to the constructors? ** with some small tweaks to the while loop, the {{wait()}} call could actual read this value dynamically from the cache element, eliminating the need to call {{setRunCleanupThread()}} from inside {{setMaxIdleTime()}} in the event that the value is changed dynamically. *** which is currently broken anyway since {{setRunCleanupThread()}} is currently a No-Op if {{this.runCleanupThread}} is true and {{cleanupThread}} is already non-null. ** assuming {{CleanupThread}} is changed to dynamically read the maxIdleTime directly from the cache, {{setMaxIdleTime()}} could just call {{wakeThread()}} if the new maxIdleTime is less then the old maxIdleTime *** or leave the call to {{setRunCleanupThread()}} as is, but change the {{if (cleanupThread == null)}} condition of {{setRunCleanupThread()}} to have an "else" code path that calls {{wakeThread()}} so it will call {{markAndSweep()}} (with the udpated settings) and then re-wait (with the new maxIdleTime) * although not likely to be problematic in practice, you've broken backcompat on the public "ConcurrentXXXCache" class(es) by adding an arg to the constructor. ** i would suggest adding a new constructor instead, and making the old one call the new one with "-1" – if for no other reason then to simplify the touch points / discussion in the patch... ** ie: in order to make this change, you had to modify both {{TestJavaBinCodec}} and {{TemplateUpdateProcessorFactory}} – but you wound up not using a backcompat equivilent value in {{TemplateUpdateProcessorFactory}} so your changes actually modify the behavior of that (end user facing class) in an undocumented way (that users can't override, and may actually have some noticible performance impacts on "put" since that existing usage doens't involve the cleanup thread) which should be discussed before committing (but are largely unrelated to the goals in this jira) * under no circumstances should we be committing new test code that makes arbitrary {{Thread.sleep(5000)}} calls ** i am willing to say categorically that this approach: DOES. NOT. WORK. – and has represented an overwelming percentage of the root causes of our tests being unreliable *** there is no garuntee the JVM will sleep as long as you ask it too (particularly on virtual hardware) *** there is no garuntee that "background threads/logic" will be scheduled/finished during the "sleep" ** it is far better to add whatever {{@lucene.internal}} methods we need to "hook into" the core code from test code and have white-box / grey-box tests that ensure methods get called when we expect, ex: *** if we want to test that the user level configuration results in the appropriate values being set on the underlying objects, we should add public getter methods for those values to those classes, and have the test reach into the SolrCore to get those objects and assert the expected results on those methods (NOT just "wait" to see the code run and have the expected side effects) *** if we want to test that {{ConcurrentXXXCache.markAndSweep()}} gets called by the {{CleanupThread}} _eventually_ when maxIdle time is configured even if nothing calls {{wakeThread()}} then we should use a mock/subclass of the ConcurrentXXXCache that overrides {{markAndSweep()}} to set a latch that we can {{await(...)}} on from the test code. *** if we want to test that calls to {{ConcurrentXXXCache.markAndSweep()}} result in items being removed if their {{createTime}} is "too old" then we should add a special internal
[jira] [Reopened] (SOLR-13622) Add FileStream Streaming Expression
[ https://issues.apache.org/jira/browse/SOLR-13622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened SOLR-13622: - StreamExpressionTest.testFileStreamDirectoryCrawl seems to make filesystem specific assumptions that fail hard on windows.. {noformat} FAILED: org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testFileStreamDirectoryCrawl Error Message: expected: but was: Stack Trace: org.junit.ComparisonFailure: expected: but was: at __randomizedtesting.SeedInfo.seed([92C40A8131F8CF7D:362DC46DFDF7A898]:0) at org.junit.Assert.assertEquals(Assert.java:115) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testFileStreamDirectoryCrawl(StreamExpressionTest.java:3128) {noformat} > Add FileStream Streaming Expression > --- > > Key: SOLR-13622 > URL: https://issues.apache.org/jira/browse/SOLR-13622 > Project: Solr > Issue Type: New Feature > Components: streaming expressions >Reporter: Joel Bernstein >Assignee: Jason Gerlowski >Priority: Major > Fix For: 8.3 > > Attachments: SOLR-13622.patch, SOLR-13622.patch > > > The FileStream will read files from a local filesystem and Stream back each > line of the file as a tuple. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13678) ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent zkCallback thread on props watcher
[ https://issues.apache.org/jira/browse/SOLR-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899236#comment-16899236 ] Hoss Man commented on SOLR-13678: - AFAICT CollectionPropWatcher isn't used internally by solr anywhere, so this issue will only ipmact solr clients that explicitly register their own watchers. /cc [~tomasflobbe] & [~prusko] and linking to SOLR-11960 where this was ntroduced. > ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent > zkCallback thread on props watcher > -- > > Key: SOLR-13678 > URL: https://issues.apache.org/jira/browse/SOLR-13678 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: collectionpropswatcher-deadlock-jstack.txt > > > while investigating an (unrelated) test bug in CollectionPropsTest I > discovered a deadlock situation that can occur when calling > {{ZkStateReader.removeCollectionPropsWatcher()}} if a zkCallback thread tries > to concurrently fire the watchers set on the collection props. > {{ZkStateReader.removeCollectionPropsWatcher()}} is itself called when a > {{CollectionPropsWatcher.onStateChanged()}} impl returns "true" -- meaning > that IIUC any usage of {{CollectionPropsWatcher}} could potentially result in > this type of deadlock situation. > {noformat} > "TEST-CollectionPropsTest.testReadWriteCached-seed#[D3C6921874D1CFEB]" #15 > prio=5 os_prio=0 cpu=567.78ms elapsed=682.12s tid=0x7 > fa5e8343800 nid=0x3f61 waiting for monitor entry [0x7fa62d222000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.solr.common.cloud.ZkStateReader.lambda$removeCollectionPropsWatcher$20(ZkStateReader.java:2001) > - waiting to lock <0xe6207500> (a > java.util.concurrent.ConcurrentHashMap) > at > org.apache.solr.common.cloud.ZkStateReader$$Lambda$617/0x0001006c1840.apply(Unknown > Source) > at > java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1932) > - locked <0xeb9156b8> (a > java.util.concurrent.ConcurrentHashMap$Node) > at > org.apache.solr.common.cloud.ZkStateReader.removeCollectionPropsWatcher(ZkStateReader.java:1994) > at > org.apache.solr.cloud.CollectionPropsTest.testReadWriteCached(CollectionPropsTest.java:125) > ... > "zkCallback-88-thread-2" #213 prio=5 os_prio=0 cpu=14.06ms elapsed=672.65s > tid=0x7fa6041bf000 nid=0x402f waiting for monitor ent > ry [0x7fa5b8f39000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1923) > - waiting to lock <0xeb9156b8> (a > java.util.concurrent.ConcurrentHashMap$Node) > at > org.apache.solr.common.cloud.ZkStateReader$PropsNotification.(ZkStateReader.java:2262) > at > org.apache.solr.common.cloud.ZkStateReader.notifyPropsWatchers(ZkStateReader.java:2243) > at > org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.refreshAndWatch(ZkStateReader.java:1458) > - locked <0xe6207500> (a > java.util.concurrent.ConcurrentHashMap) > at > org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.process(ZkStateReader.java:1440) > at > org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor.lambda$process$1(SolrZkClient.java:838) > at > org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor$$Lambda$253/0x0001004a4440.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$140/0x000100308c40.run(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13678) ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent zkCallback thread on props watcher
[ https://issues.apache.org/jira/browse/SOLR-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13678: Attachment: collectionpropswatcher-deadlock-jstack.txt Status: Open (was: Open) attaching the full jstack output that i captured from observing this during a run of {{CollectionPropsTest.testReadWriteCached}} (ie: the source of the snippet included in the summary) Please note that i captured this threaddump while in the process of testing some unrelated changes to other methods in {{CollectionPropsTest}} -- i believe all of my local changes to that test class at the time this thread dump was captured were to code that appeared farther down in the test file then any line numbers that might be mentioned in this threaddump, so all line numbers should be accurate on master circa ~ 52b5ec8068, but i'm not 100% certain. the key thing to focus on is the line numbers and callstack for the non-test code i am 100% certain i had no local changes to the {{CollectionPropsTest.testReadWriteCached}}, or any non-test code. > ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent > zkCallback thread on props watcher > -- > > Key: SOLR-13678 > URL: https://issues.apache.org/jira/browse/SOLR-13678 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: collectionpropswatcher-deadlock-jstack.txt > > > while investigating an (unrelated) test bug in CollectionPropsTest I > discovered a deadlock situation that can occur when calling > {{ZkStateReader.removeCollectionPropsWatcher()}} if a zkCallback thread tries > to concurrently fire the watchers set on the collection props. > {{ZkStateReader.removeCollectionPropsWatcher()}} is itself called when a > {{CollectionPropsWatcher.onStateChanged()}} impl returns "true" -- meaning > that IIUC any usage of {{CollectionPropsWatcher}} could potentially result in > this type of deadlock situation. > {noformat} > "TEST-CollectionPropsTest.testReadWriteCached-seed#[D3C6921874D1CFEB]" #15 > prio=5 os_prio=0 cpu=567.78ms elapsed=682.12s tid=0x7 > fa5e8343800 nid=0x3f61 waiting for monitor entry [0x7fa62d222000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.solr.common.cloud.ZkStateReader.lambda$removeCollectionPropsWatcher$20(ZkStateReader.java:2001) > - waiting to lock <0xe6207500> (a > java.util.concurrent.ConcurrentHashMap) > at > org.apache.solr.common.cloud.ZkStateReader$$Lambda$617/0x0001006c1840.apply(Unknown > Source) > at > java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1932) > - locked <0xeb9156b8> (a > java.util.concurrent.ConcurrentHashMap$Node) > at > org.apache.solr.common.cloud.ZkStateReader.removeCollectionPropsWatcher(ZkStateReader.java:1994) > at > org.apache.solr.cloud.CollectionPropsTest.testReadWriteCached(CollectionPropsTest.java:125) > ... > "zkCallback-88-thread-2" #213 prio=5 os_prio=0 cpu=14.06ms elapsed=672.65s > tid=0x7fa6041bf000 nid=0x402f waiting for monitor ent > ry [0x7fa5b8f39000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1923) > - waiting to lock <0xeb9156b8> (a > java.util.concurrent.ConcurrentHashMap$Node) > at > org.apache.solr.common.cloud.ZkStateReader$PropsNotification.(ZkStateReader.java:2262) > at > org.apache.solr.common.cloud.ZkStateReader.notifyPropsWatchers(ZkStateReader.java:2243) > at > org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.refreshAndWatch(ZkStateReader.java:1458) > - locked <0xe6207500> (a > java.util.concurrent.ConcurrentHashMap) > at > org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.process(ZkStateReader.java:1440) > at > org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor.lambda$process$1(SolrZkClient.java:838) > at > org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor$$Lambda$253/0x0001004a4440.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$140/0x000100308c40.run(Unknown > Source) > at > ja
[jira] [Created] (SOLR-13678) ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent zkCallback thread on props watcher
Hoss Man created SOLR-13678: --- Summary: ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent zkCallback thread on props watcher Key: SOLR-13678 URL: https://issues.apache.org/jira/browse/SOLR-13678 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man while investigating an (unrelated) test bug in CollectionPropsTest I discovered a deadlock situation that can occur when calling {{ZkStateReader.removeCollectionPropsWatcher()}} if a zkCallback thread tries to concurrently fire the watchers set on the collection props. {{ZkStateReader.removeCollectionPropsWatcher()}} is itself called when a {{CollectionPropsWatcher.onStateChanged()}} impl returns "true" -- meaning that IIUC any usage of {{CollectionPropsWatcher}} could potentially result in this type of deadlock situation. {noformat} "TEST-CollectionPropsTest.testReadWriteCached-seed#[D3C6921874D1CFEB]" #15 prio=5 os_prio=0 cpu=567.78ms elapsed=682.12s tid=0x7 fa5e8343800 nid=0x3f61 waiting for monitor entry [0x7fa62d222000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.solr.common.cloud.ZkStateReader.lambda$removeCollectionPropsWatcher$20(ZkStateReader.java:2001) - waiting to lock <0xe6207500> (a java.util.concurrent.ConcurrentHashMap) at org.apache.solr.common.cloud.ZkStateReader$$Lambda$617/0x0001006c1840.apply(Unknown Source) at java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1932) - locked <0xeb9156b8> (a java.util.concurrent.ConcurrentHashMap$Node) at org.apache.solr.common.cloud.ZkStateReader.removeCollectionPropsWatcher(ZkStateReader.java:1994) at org.apache.solr.cloud.CollectionPropsTest.testReadWriteCached(CollectionPropsTest.java:125) ... "zkCallback-88-thread-2" #213 prio=5 os_prio=0 cpu=14.06ms elapsed=672.65s tid=0x7fa6041bf000 nid=0x402f waiting for monitor ent ry [0x7fa5b8f39000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1923) - waiting to lock <0xeb9156b8> (a java.util.concurrent.ConcurrentHashMap$Node) at org.apache.solr.common.cloud.ZkStateReader$PropsNotification.(ZkStateReader.java:2262) at org.apache.solr.common.cloud.ZkStateReader.notifyPropsWatchers(ZkStateReader.java:2243) at org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.refreshAndWatch(ZkStateReader.java:1458) - locked <0xe6207500> (a java.util.concurrent.ConcurrentHashMap) at org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.process(ZkStateReader.java:1440) at org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor.lambda$process$1(SolrZkClient.java:838) at org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor$$Lambda$253/0x0001004a4440.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) at java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$140/0x000100308c40.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13664: Resolution: Fixed Fix Version/s: 8.3 master (9.0) Status: Resolved (was: Patch Available) > SolrTestCaseJ4.deleteCore() does not delete/clean dataDir > -- > > Key: SOLR-13664 > URL: https://issues.apache.org/jira/browse/SOLR-13664 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: master (9.0), 8.3 > > Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, > SOLR-13664.patch > > > Prior to Solr 8.3, the javadocs for {{SolrTestCaseJ4.deleteCore()}} said that > that method would delete the dataDir used by {{initCore()}} in spite of that > method not actaully doing anything to clean up the dataDir for a very long > time (exactly when the bug was introduced is not known) > For that reason, in most solr versions up to and including 8.2, tests that > called combinations of {{initCore()}} / {{deleteCore()}} within a single test > class would see the data from a previous core polluting the data of a newly > introduced core. > As part of this jira, this bug was fixed, by udpating {{deleteCore()}} to > "reset" the value of the {{initCoreDataDir}} variable to null, so that it > can/will be re-initialized on the next call to either {{initCore()}} or the > lower level {{createCore()}}. > Existing tests that refer to the {{initCoreDataDir}} directly (either before, > or during the lifecycle of an active core managed via {{initCore()}} / > {{deleteCore()}} ) may encounter {{NullPointerExceptions}} on upgrading to > Solr 8.3 as a result of this bug fix. These tests are encouraged to use the > new helper method {{initAndGetDataDir()}} in place of directly refering to > the (now deprecated) {{initCoreDataDir}} variable directly. > Any existing tests that refer to the {{initCoreDataDir}} directly *after* > calling {{deleteCore()}} with the intention of inspecting the index contents > after shutdown, will need to be modified to preserved the rsults of calling > {{initAndGetDataDir()}} into a new variable for such introspection – the > actual contents of the directory will not be removed until the full ifecycle > of the test class is complete (see {{LuceneTestCase.createTempDir()}}) > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13664: Description: Prior to Solr 8.3, the javadocs for {{SolrTestCaseJ4.deleteCore()}} said that that method would delete the dataDir used by {{initCore()}} in spite of that method not actaully doing anything to clean up the dataDir for a very long time (exactly when the bug was introduced is not known) For that reason, in most solr versions up to and including 8.2, tests that called combinations of {{initCore()}} / {{deleteCore()}} within a single test class would see the data from a previous core polluting the data of a newly introduced core. As part of this jira, this bug was fixed, by udpating {{deleteCore()}} to "reset" the value of the {{initCoreDataDir}} variable to null, so that it can/will be re-initialized on the next call to either {{initCore()}} or the lower level {{createCore()}}. Existing tests that refer to the {{initCoreDataDir}} directly (either before, or during the lifecycle of an active core managed via {{initCore()}} / {{deleteCore()}} ) may encounter {{NullPointerExceptions}} on upgrading to Solr 8.3 as a result of this bug fix. These tests are encouraged to use the new helper method {{initAndGetDataDir()}} in place of directly refering to the (now deprecated) {{initCoreDataDir}} variable directly. Any existing tests that refer to the {{initCoreDataDir}} directly *after* calling {{deleteCore()}} with the intention of inspecting the index contents after shutdown, will need to be modified to preserved the rsults of calling {{initAndGetDataDir()}} into a new variable for such introspection – the actual contents of the directory will not be removed until the full ifecycle of the test class is complete (see {{LuceneTestCase.createTempDir()}}) was: In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does nothing to delete the dataDir used by the TestHarness The git history is a bit murky, so i'm not entirely certain when this stoped working, but I suspect it happened as part of the overall cleanup regarding test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> TestRuleTemporaryFilesCleanup}} While this is not problematic in many test classes, where a single {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses that SolrCore for all test methods and relies on {{@AfterClass SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's problematic in test classes where {{deleteCore()}} is explicitly called in an {{@After}} method to ensure a unique core (w/unique dataDir) is used for each test method. (there are currently about 61 tests that call {{deleteCore()}} directly) updated jira summary to be more helpful to users who may find this jira via CHANGES.txt pointer and need more information on how it affects them if they have their own custom tests using SolrTestCaseJ4. > SolrTestCaseJ4.deleteCore() does not delete/clean dataDir > -- > > Key: SOLR-13664 > URL: https://issues.apache.org/jira/browse/SOLR-13664 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, > SOLR-13664.patch > > > Prior to Solr 8.3, the javadocs for {{SolrTestCaseJ4.deleteCore()}} said that > that method would delete the dataDir used by {{initCore()}} in spite of that > method not actaully doing anything to clean up the dataDir for a very long > time (exactly when the bug was introduced is not known) > For that reason, in most solr versions up to and including 8.2, tests that > called combinations of {{initCore()}} / {{deleteCore()}} within a single test > class would see the data from a previous core polluting the data of a newly > introduced core. > As part of this jira, this bug was fixed, by udpating {{deleteCore()}} to > "reset" the value of the {{initCoreDataDir}} variable to null, so that it > can/will be re-initialized on the next call to either {{initCore()}} or the > lower level {{createCore()}}. > Existing tests that refer to the {{initCoreDataDir}} directly (either before, > or during the lifecycle of an active core managed via {{initCore()}} / > {{deleteCore()}} ) may encounter {{NullPointerExceptions}} on upgrading to > Solr 8.3 as a result of this bug fix. These tests are encouraged to use the > new helper method {{initAndGetDataDir()}} in place of directly refering to > the (now deprecated) {{initCoreDataDir}} variable directly. > Any existing tests that refer to the {{initCoreDataDir}} directly *after* > calling {{deleteCore()}} with the intention of inspecting the index
[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13664: Status: Patch Available (was: Open) > SolrTestCaseJ4.deleteCore() does not delete/clean dataDir > -- > > Key: SOLR-13664 > URL: https://issues.apache.org/jira/browse/SOLR-13664 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, > SOLR-13664.patch > > > In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does > nothing to delete the dataDir used by the TestHarness > The git history is a bit murky, so i'm not entirely certain when this stoped > working, but I suspect it happened as part of the overall cleanup regarding > test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> > TestRuleTemporaryFilesCleanup}} > While this is not problematic in many test classes, where a single > {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses > that SolrCore for all test methods and relies on {{@AfterClass > SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's > problematic in test classes where {{deleteCore()}} is explicitly called in an > {{@After}} method to ensure a unique core (w/unique dataDir) is used for each > test method. > (there are currently about 61 tests that call {{deleteCore()}} directly) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13664: Attachment: SOLR-13664.patch Status: Open (was: Open) Updated patch to: * update all remaining tests that still refered to (the now deprecated {{initCoreDataDir}} ) to either use {{initAndGetDataDir()}} or just use {{createTempDir()}} when their usage never any reason to re-use the {{initCore()}} dataDir anyway * fix a few precommit issues (unused imports). I'm still testing, but i think this is ready... > SolrTestCaseJ4.deleteCore() does not delete/clean dataDir > -- > > Key: SOLR-13664 > URL: https://issues.apache.org/jira/browse/SOLR-13664 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, > SOLR-13664.patch > > > In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does > nothing to delete the dataDir used by the TestHarness > The git history is a bit murky, so i'm not entirely certain when this stoped > working, but I suspect it happened as part of the overall cleanup regarding > test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> > TestRuleTemporaryFilesCleanup}} > While this is not problematic in many test classes, where a single > {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses > that SolrCore for all test methods and relies on {{@AfterClass > SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's > problematic in test classes where {{deleteCore()}} is explicitly called in an > {{@After}} method to ensure a unique core (w/unique dataDir) is used for each > test method. > (there are currently about 61 tests that call {{deleteCore()}} directly) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897525#comment-16897525 ] Hoss Man edited comment on SOLR-13664 at 7/31/19 9:03 PM: -- Testing of the last patch uncovered 3 classes of problems in our existing tests: # tests trying to call {{FileUtils.deleteDirectory(initCoreDataDir);}} after calling {{deleteCore()}} (specifically to work around this bug!) that now get NPE #* Example: TestRecovery # tests that need to be able to write files to {{initCoreDataDir}} *before* calling {{initCore()}} _and_ call {{initCore()}} + {{deleteCore()}} in the intividual test method life cycle #* the patch already ensures {{initCoreDataDir}} is created before the subclass is initialized, so tests that just used a single {{initCore()}} call for all test methods would be fine -- it's only tests that also {{deleteCore()}} in {{@After}} methods that are problems #* Example: QueryElevationComponentTest, SolrCoreCheckLockOnStartupTest # one test that doesn't even use {{initCore()}} -- it builds it's own TestHarness/CoreContainer using {{initCoreDataDir}} directly -- but like #2, calls {{deleteCore()}} in {{@After}} methods (to leverate the common cleanup of the TestHarness) #* Example: SolrMetricsIntegrationTest Based on these classes of problems, I think the best way forward is to update the existing patch to: * make the {{initAndGetDataDir()}} private helper method I introduced in the last patch public and change it to return the {{File}} (not String) and beef up it's javadocs * deprecate {{initCoreDataDir}} and change all existing direct uses of it in our tests to use the helper method This makes fixing the existing test problems trivial: just replace all uses of {{initCoreDataDir}} to {{initAndGetDataDir()}} ... any logic attempting to seed/inspect the dataDir prior to {{initCore()}} will initialize the directory that will be used by the next subsequent {{initCore()}} call. I've attached an updated patch with this changes, but! ... for completeness, I think it's important to also consider about how this will impact any existing third-party tests downstream users may have written that subclass {{SolrTestCaseJ4}: * if they don't refer to {{initCoreDataDir}} directly, then like the existing patch the only change in behavior they should notice is that if their tests call {{deleteCore()}} any subsequent {{initCore()}} calls won't be polluted with the old data. * if they do refer to {{initCoreDataDir}} directly in tests, then that usage _may_ continue to work as is if the usage is only "between" calls to {{initCore()}} and {{deleteCore()}} (ie: to inspect the data dir) * if they attempt to use {{initCoreDataDir}} _after_ calling {{deleteCore()}} (either directly, or indirectly by referencing it before a call to {{initCore()}} in a test lifecycle that involves multiple {{initCore()}} + {{deleteCore()}} pairs) then they will start getting NPEs and will need to change their test to use {{initAndGetDataDir()}} directly. I think the tradeoff of fixing this bug vs the impact on end users is worth making this change: right now the bug can silently affect users w/weird results, but any tests that are impacted adversly by this change will trigger loud NPEs and have an easy fix we can mention in the ugprade notes. was (Author: hossman): Testing of the last patch uncovered 3 classes of problems in our existing tests: # tests trying to call {{FileUtils.deleteDirectory(initCoreDataDir);}} after calling {{deleteCore()}} (specifically to work around this bug!) that now get NPE #* Example: TestRecovery # tests that need to be able to write files to {{initCoreDataDir}} *before* calling {{initCore()}} _and_ call {{initCore()}} + {{deleteCore()}} in the intividual test method life cycle #* the patch already ensures {{initCoreDataDir}} is created before the subclass is initialized, so tests that just used a single {{initCore()}} call for all test methods would be fine -- it's only tests that also {{deleteCore()}} in {{@After}} methods that are problems #* Example: QueryElevationComponentTest, SolrCoreCheckLockOnStartupTest * one test that doesn't even use {{initCore()}} -- it builds it's own TestHarness/CoreContainer using {{initCoreDataDir}} directly -- but like #2, calls {{deleteCore()}} in {{@After}} methods (to leverate the common cleanup of the TestHarness) ** Example: SolrMetricsIntegrationTest Based on these classes of problems, I think the best way forward is to update the existing patch to: * make the {{initAndGetDataDir()}} private helper method I introduced in the last patch public and change it to return the {{File}} (not String) and beef up it's javadocs * deprecate {{initCoreDataDir}} and change all existing direct uses of it in our tests to use the helper method This makes fixing the existing test problems tr
[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13664: Attachment: SOLR-13664.patch Status: Open (was: Open) Testing of the last patch uncovered 3 classes of problems in our existing tests: # tests trying to call {{FileUtils.deleteDirectory(initCoreDataDir);}} after calling {{deleteCore()}} (specifically to work around this bug!) that now get NPE #* Example: TestRecovery # tests that need to be able to write files to {{initCoreDataDir}} *before* calling {{initCore()}} _and_ call {{initCore()}} + {{deleteCore()}} in the intividual test method life cycle #* the patch already ensures {{initCoreDataDir}} is created before the subclass is initialized, so tests that just used a single {{initCore()}} call for all test methods would be fine -- it's only tests that also {{deleteCore()}} in {{@After}} methods that are problems #* Example: QueryElevationComponentTest, SolrCoreCheckLockOnStartupTest * one test that doesn't even use {{initCore()}} -- it builds it's own TestHarness/CoreContainer using {{initCoreDataDir}} directly -- but like #2, calls {{deleteCore()}} in {{@After}} methods (to leverate the common cleanup of the TestHarness) ** Example: SolrMetricsIntegrationTest Based on these classes of problems, I think the best way forward is to update the existing patch to: * make the {{initAndGetDataDir()}} private helper method I introduced in the last patch public and change it to return the {{File}} (not String) and beef up it's javadocs * deprecate {{initCoreDataDir}} and change all existing direct uses of it in our tests to use the helper method This makes fixing the existing test problems trivial: just replace all uses of {{initCoreDataDir}} to {{initAndGetDataDir()}} ... any logic attempting to seed/inspect the dataDir prior to {{initCore()}} will initialize the directory that will be used by the next subsequent {{initCore()}} call. I've attached an updated patch with this changes, but! ... for completeness, I think it's important to also consider about how this will impact any existing third-party tests downstream users may have written that subclass {{SolrTestCaseJ4}: * if they don't refer to {{initCoreDataDir}} directly, then like the existing patch the only change in behavior they should notice is that if their tests call {{deleteCore()}} any subsequent {{initCore()}} calls won't be polluted with the old data. * if they do refer to {{initCoreDataDir}} directly in tests, then that usage _may_ continue to work as is if the usage is only "between" calls to {{initCore()}} and {{deleteCore()}} (ie: to inspect the data dir) * if they attempt to use {{initCoreDataDir}} _after_ calling {{deleteCore()}} (either directly, or indirectly by referencing it before a call to {{initCore()}} in a test lifecycle that involves multiple {{initCore()}} + {{deleteCore()}} pairs) then they will start getting NPEs and will need to change their test to use {{initAndGetDataDir()}} directly. I think the tradeoff of fixing this bug vs the impact on end users is worth making this change: right now the bug can silently affect users w/weird results, but any tests that are impacted adversly by this change will trigger loud NPEs and have an easy fix we can mention in the ugprade notes. > SolrTestCaseJ4.deleteCore() does not delete/clean dataDir > -- > > Key: SOLR-13664 > URL: https://issues.apache.org/jira/browse/SOLR-13664 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch > > > In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does > nothing to delete the dataDir used by the TestHarness > The git history is a bit murky, so i'm not entirely certain when this stoped > working, but I suspect it happened as part of the overall cleanup regarding > test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> > TestRuleTemporaryFilesCleanup}} > While this is not problematic in many test classes, where a single > {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses > that SolrCore for all test methods and relies on {{@AfterClass > SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's > problematic in test classes where {{deleteCore()}} is explicitly called in an > {{@After}} method to ensure a unique core (w/unique dataDir) is used for each > test method. > (there are currently about 61 tests that call {{deleteCore()}} directly) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (SOLR-13579) Create resource management API
[ https://issues.apache.org/jira/browse/SOLR-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897347#comment-16897347 ] Hoss Man commented on SOLR-13579: - bq. We could perhaps call a type-safe and name-safe component API from a generic management API by following a similar convention as the one used in SolrPluginUtils.invokeSetters? Or use marker interfaces that also provide validation / conversion. I'll look into this. Unless there's something i'm missing (and that's incredibly likely) I don't even think you'd need a SolrPluginUtils.invokeSetters type hack for any of this -- except maybe mapping REST commands in the ResourceManagerHandler to methods in the ResourceManagerPlugins? what i was imagining was a more straightfoward subclass/subinterface relationship and using generics to tightly couple the ManagedComponent impls to the corresponding ResourceManagerPlugins -- so the plugins could hav a completey staticly typed APIs for calling methods on the Components. ala... {code} public interface ManagedComponent { ManagedComponentId getManagedComponentId(); ... } public abstract ResourceManagerPlugin { /** if needed by ResourceManagerHandler or metrics */ public abstract void setResourceLimits(ManagedComponentId component, Map limits); /** if needed by ResourceManagerHandler or metrics */ public abstract Map getResourceLimits(ManagedComponentId component); ... // other general API methods needed for linking/registering type "T" components // (or Pool) and for "managing" all of them... ... } public interface ManagedCacheComponent implements ManagedComponent { // actual caches implement this, and only have to worry about type specific methods // for managing their resource realted settings -- nothing about the REST API... public void setMaxSize(long size); public void setMaxRamMB(int maxRamMB); public long getMaxSize(); public int getMaxRamMB(); } public class CacheManagerPlugin extends ResourceManagerPlugin { // comncrete impls like this can use the staticly typed get/set methods of the concrete // ManagedComponent impls in their getResourceLimits/setResourceLimits & manage methods ... } {code} > Create resource management API > -- > > Key: SOLR-13579 > URL: https://issues.apache.org/jira/browse/SOLR-13579 > Project: Solr > Issue Type: New Feature >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Attachments: SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch, > SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch > > > Resource management framework API supporting the goals outlined in SOLR-13578. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13664: Attachment: SOLR-13664.patch Status: Open (was: Open) Here's an updated patch with a fix that i _think_ is good, but i'm still in the process of testing and I want spend some more time thinking through possible ramifications on third party subclasses. The basic idea is that {{deleteCore()}} now nulls out the {{initCoreDataDir}} variable -- w/o doing any actual IO deletion. We still trust/rely on {{TestRuleTemporaryFilesCleanup}} to do it's job of deleting these temp dirs if the test succeeds. Any place in {{SolrTestCaseJ4}} that currently depends on {{initCoreDataDir}} being set now uses a private helper method to ensure it's initialized. > SolrTestCaseJ4.deleteCore() does not delete/clean dataDir > -- > > Key: SOLR-13664 > URL: https://issues.apache.org/jira/browse/SOLR-13664 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13664.patch, SOLR-13664.patch > > > In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does > nothing to delete the dataDir used by the TestHarness > The git history is a bit murky, so i'm not entirely certain when this stoped > working, but I suspect it happened as part of the overall cleanup regarding > test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> > TestRuleTemporaryFilesCleanup}} > While this is not problematic in many test classes, where a single > {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses > that SolrCore for all test methods and relies on {{@AfterClass > SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's > problematic in test classes where {{deleteCore()}} is explicitly called in an > {{@After}} method to ensure a unique core (w/unique dataDir) is used for each > test method. > (there are currently about 61 tests that call {{deleteCore()}} directly) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
[ https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13664: Attachment: SOLR-13664.patch Status: Open (was: Open) The attached patch doesn't fix the problem – still thinking about the best solution to move forward – but it does trivially demonstrates this problem in a new test. It also updates {{TestUseDocValuesAsStored}} to include a sanity check against this problem. A weird {{TestUseDocValuesAsStored}} jenkins failure is how i discovered this in the first place... apache_Lucene-Solr-Tests-8.2_34.log.txt {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestUseDocValuesAsStored -Dtests.method=testDuplicateMultiValued -Dtests.seed=69AC8730651B9CCD -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=ja -Dtests.timezone=America/Argentina/ComodRivadavia -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 1.13s J0 | TestUseDocValuesAsStored.testDuplicateMultiValued <<< [junit4]> Throwable #1: java.lang.RuntimeException: Exception during query [junit4]>at __randomizedtesting.SeedInfo.seed([69AC8730651B9CCD:87719310ABAA6A71]:0) [junit4]>at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:947) [junit4]>at org.apache.solr.schema.TestUseDocValuesAsStored.doTest(TestUseDocValuesAsStored.java:367) [junit4]>at org.apache.solr.schema.TestUseDocValuesAsStored.testDuplicateMultiValued(TestUseDocValuesAsStored.java:172) [junit4]>at java.lang.Thread.run(Thread.java:748) [junit4]> Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//arr[@name='enums_dvo']/str[.='Not Available'] [junit4]>xml response was: [junit4]> [junit4]> 00xyzmyid1XY2XXY3XY4-66642425-66.6664.24.26-6664204207-6.E-50.00420.004281999-12-31T23:59:59Z2016-07-04T03:02:01Z2016-07-04T03:02:01Z [junit4]> [junit4]>request was:q=*:*&fl=*&wt=xml [junit4]>at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:940) {noformat} ...what's happening here is that docs from previous test methods in this class (that should have been using their own distinct cores + dataDirs) are bleeding into this test, causing the doc the test is checking for in to be pushed out past the {{rows=10}} results. (note the {{numFound="11"}}) > SolrTestCaseJ4.deleteCore() does not delete/clean dataDir > -- > > Key: SOLR-13664 > URL: https://issues.apache.org/jira/browse/SOLR-13664 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13664.patch > > > In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does > nothing to delete the dataDir used by the TestHarness > The git history is a bit murky, so i'm not entirely certain when this stoped > working, but I suspect it happened as part of the overall cleanup regarding > test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> > TestRuleTemporaryFilesCleanup}} > While this is not problematic in many test classes, where a single > {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses > that SolrCore for all test methods and relies on {{@AfterClass > SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's > problematic in test classes where {{deleteCore()}} is explicitly called in an > {{@After}} method to ensure a unique core (w/unique dataDir) is used for each > test method. > (there are currently about 61 tests that call {{deleteCore()}} directly) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
Hoss Man created SOLR-13664: --- Summary: SolrTestCaseJ4.deleteCore() does not delete/clean dataDir Key: SOLR-13664 URL: https://issues.apache.org/jira/browse/SOLR-13664 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Assignee: Hoss Man In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does nothing to delete the dataDir used by the TestHarness The git history is a bit murky, so i'm not entirely certain when this stoped working, but I suspect it happened as part of the overall cleanup regarding test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> TestRuleTemporaryFilesCleanup}} While this is not problematic in many test classes, where a single {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses that SolrCore for all test methods and relies on {{@AfterClass SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's problematic in test classes where {{deleteCore()}} is explicitly called in an {{@After}} method to ensure a unique core (w/unique dataDir) is used for each test method. (there are currently about 61 tests that call {{deleteCore()}} directly) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken
[ https://issues.apache.org/jira/browse/SOLR-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13660: Resolution: Fixed Fix Version/s: 8.3 master (9.0) Status: Resolved (was: Patch Available) > AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken > - > > Key: SOLR-13660 > URL: https://issues.apache.org/jira/browse/SOLR-13660 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: master (9.0), 8.3 > > Attachments: SOLR-13660.patch > > > {{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, > and does not actually check that the replicas are active. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken
[ https://issues.apache.org/jira/browse/SOLR-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13660: Attachment: SOLR-13660.patch Status: Open (was: Open) Allthough this method is not used directly in many Solr tests that subclass {{AbstractFullDistribZkTestBase}} it is used by other methods in {{AbstractFullDistribZkTestBase}} -- including when creating the {{DEFAULT_COLLECTION}}. Because of the esoteric way {{AbstractFullDistribZkTestBase}} initializes it's collections (and jetty instances) almost every replica created starts in recovery -- so as a result of this bug, subclasses may frequently see their test methods being invoked before the expected number of shards/replicas. In at least one case (TestCloudSchemaless) this has lead to test failures (ultimately due to requests timing out when trying to add documents) as a result of test client operations competing with multiple concurrent replica recoveries on CPU constrained jenkins machines. The attached patch: * fixes {{waitForActiveReplicaCount(...)}} to check that the replicas are active * deprecates and updates the javadocs of {{getTotalReplicas(...)}} to make it clear that this method doesn't care about the status of the replica. ** this method was formally used by {{waitForActiveReplicaCount(...)}} * also makes some related fixes to {{createJettys(...)}}: ** adds some comments clarifying how this method initializes the shards vs addingthe replicas ** improves the initial slice count check to use existing helper methods which also verifies the slices are active *** this doesn't really affect the correctness of the method given how the collection is used at this point, but helps simplify the code. > AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken > - > > Key: SOLR-13660 > URL: https://issues.apache.org/jira/browse/SOLR-13660 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13660.patch > > > {{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, > and does not actually check that the replicas are active. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken
[ https://issues.apache.org/jira/browse/SOLR-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13660: Status: Patch Available (was: Open) > AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken > - > > Key: SOLR-13660 > URL: https://issues.apache.org/jira/browse/SOLR-13660 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-13660.patch > > > {{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, > and does not actually check that the replicas are active. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken
Hoss Man created SOLR-13660: --- Summary: AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken Key: SOLR-13660 URL: https://issues.apache.org/jira/browse/SOLR-13660 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Assignee: Hoss Man {{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, and does not actually check that the replicas are active. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades
[ https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-13599. - Resolution: Cannot Reproduce not a single jenkins failure in this test since backporting the logging additions to brach_8x on July8. doesn't seem like there is much more we can do here > ReplicationFactorTest high failure rate on Windows jenkins VMs after > 2019-06-22 OS/java upgrades > > > Key: SOLR-13599 > URL: https://issues.apache.org/jira/browse/SOLR-13599 > Project: Solr > Issue Type: Bug >Reporter: Hoss Man >Priority: Major > Attachments: thetaphi_Lucene-Solr-master-Windows_8025.log.txt > > > We've started seeing some weirdly consistent (but not reliably reproducible) > failures from ReplicationFactorTest when running on Uwe's Windows jenkins > machines. > The failures all seem to have started on June 22 -- when Uwe upgraded his > Windows VMs to upgrade the Java version, but happen across all versions of > java tested, and on both the master and branch_8x. > While this test failed a total of 5 times, in different ways, on various > jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on > all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and > when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins > builds frequently fails anywhere from 1-4 additional times. > All of these failures occur in the exact same place, with the exact same > assertion: that the expected replicationFactor of 2 was not achieved, and an > rf=1 (ie: only the master) was returned, when sending a _batch_ of documents > to a collection with 1 shard, 3 replicas; while 1 of the replicas was > partitioned off due to a closed proxy. > In the handful of logs I've examined closely, the 2nd "live" replica does in > fact log that it recieved & processed the update, but with a QTime of over 30 > seconds, and it then it immediately logs an > {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- > meanwhile, the leader has one ({{updateExecutor}} thread logging copious > amount of {{java.net.ConnectException: Connection refused: no further > information}} regarding the replica that was partitioned off, before a second > {{updateExecutor}} thread ultimately logs > {{java.util.concurrent.ExecutionException: > java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" > replica. > > What makes this perplexing is that this is not the first time in the test > that documents were added to this collection while one replica was > partitioned off, but it is the first time that all 3 of the following are > true _at the same time_: > # the collection has recovered after some replicas were partitioned and > re-connected > # a batch of multiple documents is being added > # one replica has been "re" partitioned. > ...prior to the point when this failure happens, only individual document > adds were tested while replicas where partitioned. Batches of adds were only > tested when all 3 replicas were "live" after the proxies were re-opened and > the collection had fully recovered. The failure also comes from the first > update to happen after a replica's proxy port has been "closed" for the > _second_ time. > While this conflagration of events might concievible trigger some weird bug, > what makes these failures _particularly_ perplexing is that: > * the failures only happen on Windows > * the failures only started after the Windows VM update on June-22. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13579) Create resource management API
[ https://issues.apache.org/jira/browse/SOLR-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894184#comment-16894184 ] Hoss Man commented on SOLR-13579: - Honestly, i'm still very lost. Part of my struggle is i'm trying to wade into the patch, and review the APIs and functionality it contains, while knowing – as you mentioned – that's not all the details are here, and it's not fully fleshed out w/everything you intend as far as configuration and customization and having more concrete implementations beyond just the {{CacheManagerPlugin}}. I know that in your mind there is more that can/should be done, and that some of this code is just "placeholder" for later, but i don't have enough familiarity with the "long term" plan to really understand what in the current patch is placeholder or stub APIs, vs what is "real" and exists because of long term visions for how all of these pieces can be used together in a more generalized system – ie: what classes might have surface APIs that look more complex then needed given what's currently implemented in the patch, because of how you envinsion those classes being used in the future? Just to pick one example, was my question about the "ResourceManagerPool" vs "ResourceManagerPlugin" – in your reply you said... {quote}The code in ResourceManagerPool is independent of the type of resource(s) that a pool can manage. ... {quote} ...but the code in {{ResourceManagerPlugin}} is _also_ independent of any specific type of resource(s) that a pool can manage – those specifics only exist in the concrete subclasses. Hence the crux of my question is why theses two very generalized pieces of abstract functionality/data collection couldn't just be a single abstract base class for all (concrete) ResourceManagerPlugin subclasses to extend? Your followup gives a clue... {quote}...perhaps at some point we could allow a single pool to manage several aspects of a component, in which case a pool could have several plugins. {quote} but w/o some "concrete hypothetical" examples of what that might look like, it's hard to evaluate if the current APIs are the "best" approach, or if maybe there is something better/simpler. {quote}Also, there can be different pools of the same type, each used for a different group of components that support the same management aspect. For example, for searcher caches we may want to eventually create separate pools for filterCache, queryResultCache and fieldValueCache. All of these pools would use the same plugin implementation CacheManagerPlugin but configured with different params and limits. {quote} But even in this situation, there could be multiple *instances* of a {{CacheManagerPlugin}}, one for each pool, each with different params and limits, w/o needing distinction between the {{ResourceManagerPlugin}} concept/instances and the {{ResourceManagerPool}} concept/instances. (To be clear, i'm not trying to harp on the specific design/seperation/linkage of {{ResourceManagerPlugin}} vs {{ResourceManagerPool}} – these are just some of the first classes i looked at and had questions about. I'm just using them as examples of where/how it's hard to ask questions or form opinions about the current API/code w/o having a better grasp of some "concrete specifcs" (or even "hypothetical specifics") of when/how/where/why each of these APIs are expected to be used and interact w/each other. Another example of where i got lost as to the specific motivation behind some of these APIs in the long term view is in the "loose coupling" that currently exists in the patch between the {{ManagedComponent}} API and {{ResourceManagerPlugin}}: As i understand it: * An object in Solr supports being managed by a particular subclass of {{ResourceManagerPlugin}} if and only if it extends {{ManagedComponent}} and implementes {{ManagedComponent.getManagedResourceTypes()}} such that the resulting {{Collection}} contains a String matching the return value of a {{ResourceManagerPlugin.getType()}} for that particular {{ResourceManagerPlugin}} ** ie: {{SolrCache}} extends the {{ManagedComponent}} interface, and all classess implementeing {{SolrCache}} should/must implement {{getManagedResourceTypes()}} by returning a java {{Collection}} containing {{CacheManagerPlugin.TYPE}} * once some {{ManagedComponent}} instances are "registered in a pool" and managed by a specific {{ResourceManagerPlugin}} intsance then that plugin expects to be able to call {{ManagedComponent.setResourceLimits(Map limits)}} and {{ManagedComponent.getResourceLimits()}} on all of those {{ManagedComponent}} instances, and that both Maps should contain/support a set of {{String}} keys specific to that {{ResourceManagerPlugin}} subclass acording to {{ResourceManagerPlugin.getControlledParams()}} ** ie: {{CacheManagerPlugin.getControlledParams()}} returns a java {{Collection}} conta
[jira] [Created] (SOLR-13654) Scary jenkins failure related to collection creation: "non legacy mode coreNodeName missing"
Hoss Man created SOLR-13654: --- Summary: Scary jenkins failure related to collection creation: "non legacy mode coreNodeName missing" Key: SOLR-13654 URL: https://issues.apache.org/jira/browse/SOLR-13654 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Attachments: thetaphi_Lucene-Solr-8.2-Linux_452.log.txt A recent SplitShardTest jenkins failure has a perplexing error that i've been updable to reproduce... {noformat} [junit4]> Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:36447/_bx/t: Underlying core creation failed while creating collection: shardSplitWithRule_link {noformat} ...this exception is thrown when attempting to create a brand new 1x2 collection (prior to any splitting) using the following rule/request... {noformat} CollectionAdminRequest.Create createRequest = CollectionAdminRequest.createCollection(collectionName, "conf1", 1, 2) .setRule("shard:*,replica:<2,node:*"); {noformat} ...the logs indicate that the specific problem is that the CREATE SolrCore commands aren't inlcuding a 'coreNodeName' which is mandatory because this is a "non legacy" clusuter... {noformat} [junit4] 2> 1090551 ERROR (OverseerThreadFactory-6577-thread-5) [ ] o.a.s.c.a.c.OverseerCollectionMessageHandler Error from shard: http://127.0.0.1:36447/_bx/t [junit4] 2> => org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:36447/_bx/t: Error CREATEing SolrCore 'shardSplitWithRule_link_shard1_replica_n1': non legacy mode coreNodeName missing {collection.configName=conf1, numShards=1, shard=shard1, collection=shardSplitWithRule_link, replicaType=NRT} [junit4] 2>at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656) [junit4] 2> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:36447/_bx/t: Error CREATEing SolrCore 'shardSplitWithRule_link_shard1_replica_n1': non legacy mode coreNodeName missing {collection.configName=conf1, numShards=1, shard=shard1, collection=shardSplitWithRule_link, replicaType=NRT} [junit4] 2>at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656) ~[java/:?] [junit4] 2>at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262) ~[java/:?] [junit4] 2>at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245) ~[java/:?] [junit4] 2>at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1274) ~[java/:?] [junit4] 2>at org.apache.solr.handler.component.HttpShardHandlerFactory$1.request(HttpShardHandlerFactory.java:176) ~[java/:?] [junit4] 2>at org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:199) ~[java/:?] [junit4] 2>at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] {noformat} ...so how/why is the Overseer generating CREATE core commands w/o coreNodeName params Is this a race condition between the test setting legacyCloud=false and the Overseer processing the CREATE collection Op? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13653) java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches
[ https://issues.apache.org/jira/browse/SOLR-13653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-13653. - Resolution: Information Provided Assignee: Hoss Man > java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches > --- > > Key: SOLR-13653 > URL: https://issues.apache.org/jira/browse/SOLR-13653 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Labels: Java10, java11 > > we've seen some java11 jenkins builds that have failed due to an > AssertionError being thrown by HashMap.put as used in LRUCache -- in at least > one case these failures are semi-reproducible. (the occasional "success" is > likely due to some unpredictibility in thread contention) > Some cursory investigation suggests that JDK-8205399, first identified in > java10, and fixed in java12-b26... > https://bugs.openjdk.java.net/browse/JDK-8205399 > There does not appear to be anything we can do to mitigate this problem in > Solr. > It's also not clear to me based on th comments in JDK-8205399 if the > underlying problem can cause problems for end users running w/assertions > disabled, or if it just results in sub-optimal performance -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13653) java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches
[ https://issues.apache.org/jira/browse/SOLR-13653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892983#comment-16892983 ] Hoss Man commented on SOLR-13653: - sample failure from jenkins... http://fucit.org/solr-jenkins-reports/job-data/apache/Lucene-Solr-Tests-master/3453 {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=B1CFC66C4378F63 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=rn -Dtests.timezone=Africa/Kampala -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 18.5s J1 | TestCloudJSONFacetSKG.testRandom <<< [junit4]> Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:42227/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection: Expected mime type application/octet-stream but got text/html. [junit4]> [junit4]> [junit4]> Error 500 Server Error [junit4]> [junit4]> HTTP ERROR 500 [junit4]> Problem accessing /solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection/select. Reason: [junit4]> Server ErrorCaused by:java.lang.AssertionError [junit4]>at java.base/java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1896) [junit4]>at java.base/java.util.HashMap$TreeNode.putTreeVal(HashMap.java:2061) [junit4]>at java.base/java.util.HashMap.putVal(HashMap.java:633) [junit4]>at java.base/java.util.HashMap.put(HashMap.java:607) [junit4]>at org.apache.solr.search.LRUCache.put(LRUCache.java:201) [junit4]>at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1449) [junit4]>at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:568) [junit4]>at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1484) [junit4]>at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:398) [junit4]>at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:305) [junit4]>at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) [junit4]>at org.apache.solr.core.SolrCore.execute(SolrCore.java:2581) {noformat} As of dc8e9afff92f3ffc4081a2ecad5970eb09924a73 this seeds reproduces fairly reliably for me using... {noformat} hossman@tray:~/lucene/dev/solr/core [j11] [master] $ java -version openjdk version "11.0.3" 2019-04-16 OpenJDK Runtime Environment 18.9 (build 11.0.3+7) OpenJDK 64-Bit Server VM 18.9 (build 11.0.3+7, mixed mode) hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10 -Dtests.failfast=no -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=B1CFC66C4378F63 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=rn -Dtests.timezone=Africa/Kampala -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 ... [junit4] Tests with failures [seed: B1CFC66C4378F63]: [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom [junit4] [junit4] [junit4] JVM J0: 1.38 .. 126.31 = 124.93s [junit4] JVM J1: 1.39 .. 128.17 = 126.77s [junit4] JVM J2: 1.35 .. 123.50 = 122.15s [junit4] Execution time total: 2 minutes 8 seconds [junit4] Tests summary: 10 suites, 10 tests, 8 errors BUILD FAILED {noformat} > java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches > --- > > Key: SOLR-13653 > URL: https://issues.apache.org/jira/browse/SOLR-13653 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Labels: Java10, java11 > > we've seen some java11 jenkins builds that have failed due to an > AssertionError being thrown by HashMap.put as used in LRUCache -- in at least > one case these failures are semi-reproducible. (the occasional "success" is > likely due to some unpredictibility in thread contention) > Some curso
[jira] [Created] (SOLR-13653) java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches
Hoss Man created SOLR-13653: --- Summary: java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches Key: SOLR-13653 URL: https://issues.apache.org/jira/browse/SOLR-13653 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man we've seen some java11 jenkins builds that have failed due to an AssertionError being thrown by HashMap.put as used in LRUCache -- in at least one case these failures are semi-reproducible. (the occasional "success" is likely due to some unpredictibility in thread contention) Some cursory investigation suggests that JDK-8205399, first identified in java10, and fixed in java12-b26... https://bugs.openjdk.java.net/browse/JDK-8205399 There does not appear to be anything we can do to mitigate this problem in Solr. It's also not clear to me based on th comments in JDK-8205399 if the underlying problem can cause problems for end users running w/assertions disabled, or if it just results in sub-optimal performance -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13399) compositeId support for shard splitting
[ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892894#comment-16892894 ] Hoss Man commented on SOLR-13399: - bq. (unless you mean we've generally moved to doing doc it as part of the initial commit? If so, I missed that.) yes, that's the entire value add of keeping the ref-guide in the same repo as the source, and having it as part of the main build w/precommit. we've been trying to move to having the "code release process" and the "ref-guide release process" be a single process, with a single vote -- and we're getting close -- but the main hold up is people who add features w/o docs and then forcing a scramble during the release process to back fill docs on new features. > compositeId support for shard splitting > --- > > Key: SOLR-13399 > URL: https://issues.apache.org/jira/browse/SOLR-13399 > Project: Solr > Issue Type: New Feature >Reporter: Yonik Seeley >Assignee: Yonik Seeley >Priority: Major > Fix For: 8.3 > > Attachments: SOLR-13399.patch, SOLR-13399.patch > > > Shard splitting does not currently have a way to automatically take into > account the actual distribution (number of documents) in each hash bucket > created by using compositeId hashing. > We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* > command that would look at the number of docs sharing each compositeId prefix > and use that to create roughly equal sized buckets by document count rather > than just assuming an equal distribution across the entire hash range. > Like normal shard splitting, we should bias against splitting within hash > buckets unless necessary (since that leads to larger query fanout.) . Perhaps > this warrants a parameter that would control how much of a size mismatch is > tolerable before resorting to splitting within a bucket. > *allowedSizeDifference*? > To more quickly calculate the number of docs in each bucket, we could index > the prefix in a different field. Iterating over the terms for this field > would quickly give us the number of docs in each (i.e lucene keeps track of > the doc count for each term already.) Perhaps the implementation could be a > flag on the *id* field... something like *indexPrefixes* and poly-fields that > would cause the indexing to be automatically done and alleviate having to > pass in an additional field during indexing and during the call to > *SPLITSHARD*. This whole part is an optimization though and could be split > off into its own issue if desired. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13399) compositeId support for shard splitting
[ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892304#comment-16892304 ] Hoss Man commented on SOLR-13399: - Also: it's really not cool to be adding new end user features/params w/o at least adding a one line summary of the new param to the relevant ref-guide page. > compositeId support for shard splitting > --- > > Key: SOLR-13399 > URL: https://issues.apache.org/jira/browse/SOLR-13399 > Project: Solr > Issue Type: New Feature >Reporter: Yonik Seeley >Assignee: Yonik Seeley >Priority: Major > Fix For: 8.3 > > Attachments: SOLR-13399.patch, SOLR-13399.patch > > > Shard splitting does not currently have a way to automatically take into > account the actual distribution (number of documents) in each hash bucket > created by using compositeId hashing. > We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* > command that would look at the number of docs sharing each compositeId prefix > and use that to create roughly equal sized buckets by document count rather > than just assuming an equal distribution across the entire hash range. > Like normal shard splitting, we should bias against splitting within hash > buckets unless necessary (since that leads to larger query fanout.) . Perhaps > this warrants a parameter that would control how much of a size mismatch is > tolerable before resorting to splitting within a bucket. > *allowedSizeDifference*? > To more quickly calculate the number of docs in each bucket, we could index > the prefix in a different field. Iterating over the terms for this field > would quickly give us the number of docs in each (i.e lucene keeps track of > the doc count for each term already.) Perhaps the implementation could be a > flag on the *id* field... something like *indexPrefixes* and poly-fields that > would cause the indexing to be automatically done and alleviate having to > pass in an additional field during indexing and during the call to > *SPLITSHARD*. This whole part is an optimization though and could be split > off into its own issue if desired. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-13399) compositeId support for shard splitting
[ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened SOLR-13399: - Since the new SplitByPrefixTest was committed as part of this jira, it has fail a little over 5% of the time it's been run by jenkins -- on both master and branch_8x. All of these failures occur at the same {{assertTrue(slice1 != slice2)}} call (SplitByPrefixTest.java:222) and all of the seeds i've tested appear to reproduce reliably... on master... {noformat} hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10 -Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest -Dtests.seed=4A09C6784BF1B28F -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=ar-YE -Dtests.timezone=MET -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ... [junit4] Tests summary: 10 suites, 10 tests, 10 failures ... hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10 -Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest -Dtests.seed=75D9C45CAC5D0D22 -Dtests.slow=true -Dtests.locale=yo-BJ -Dtests.timezone=Africa/Porto-Novo -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ... [junit4] Tests summary: 10 suites, 10 tests, 10 failures ... {noformat} On branch_8x... {noformat} hossman@tray:~/lucene/dev/solr/core [j8] [branch_8x] $ ant test -Dtests.dups=10 -Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest -Dtests.seed=B980178A30F46BB3 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=ko-KR -Dtests.timezone=Africa/Abidjan -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ... [junit4] Tests summary: 10 suites, 10 tests, 10 failures ... {noformat} > compositeId support for shard splitting > --- > > Key: SOLR-13399 > URL: https://issues.apache.org/jira/browse/SOLR-13399 > Project: Solr > Issue Type: New Feature >Reporter: Yonik Seeley >Assignee: Yonik Seeley >Priority: Major > Fix For: 8.3 > > Attachments: SOLR-13399.patch, SOLR-13399.patch > > > Shard splitting does not currently have a way to automatically take into > account the actual distribution (number of documents) in each hash bucket > created by using compositeId hashing. > We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* > command that would look at the number of docs sharing each compositeId prefix > and use that to create roughly equal sized buckets by document count rather > than just assuming an equal distribution across the entire hash range. > Like normal shard splitting, we should bias against splitting within hash > buckets unless necessary (since that leads to larger query fanout.) . Perhaps > this warrants a parameter that would control how much of a size mismatch is > tolerable before resorting to splitting within a bucket. > *allowedSizeDifference*? > To more quickly calculate the number of docs in each bucket, we could index > the prefix in a different field. Iterating over the terms for this field > would quickly give us the number of docs in each (i.e lucene keeps track of > the doc count for each term already.) Perhaps the implementation could be a > flag on the *id* field... something like *indexPrefixes* and poly-fields that > would cause the indexing to be automatically done and alleviate having to > pass in an additional field during indexing and during the call to > *SPLITSHARD*. This whole part is an optimization though and could be split > off into its own issue if desired. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13637) Enable loading of plugins from the corecontainer memclassloader
[ https://issues.apache.org/jira/browse/SOLR-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892277#comment-16892277 ] Hoss Man commented on SOLR-13637: - git bisect has identified 631edee1cba00d7fa41ac6e8c597a467db56346d as the cause of a recent spike in reproducible BasicAuthIntegrationTest jenkins failures on master. (Similar failures have been observed on branch_8x as well but i have not bisected those) FWIW: Mikhail did some initial investigation into these failures in SOLR-13545 due to initial speculation that that issue caused the failures, and noted that they seemed to corrispond to the randomization of using V2 API calls. nature of failures... {noformat} [junit4] 2> 51548 ERROR (qtp206015367-815) [n:127.0.0.1:43556_solr ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: No contentStream [junit4] 2>at org.apache.solr.handler.admin.SecurityConfHandler.doEdit(SecurityConfHandler.java:103) [junit4] 2>at org.apache.solr.handler.admin.SecurityConfHandler.handleRequestBody(SecurityConfHandler.java:85) [junit4] 2>at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) [junit4] 2>at org.apache.solr.api.ApiBag$ReqHandlerToApi.call(ApiBag.java:247) [junit4] 2>at org.apache.solr.api.V2HttpCall.handleAdmin(V2HttpCall.java:341) [junit4] 2>at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:786) [junit4] 2>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:546) ... [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=BasicAuthIntegrationTest -Dtests.method=testBasicAuth -Dtests.seed=B292FDDCA6F4D6F2 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=lb-LU -Dtests.timezone=Pacific/Easter -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] FAILURE 5.21s J2 | BasicAuthIntegrationTest.testBasicAuth <<< [junit4]> Throwable #1: java.lang.AssertionError: expected:<401> but was:<400> [junit4]>at __randomizedtesting.SeedInfo.seed([B292FDDCA6F4D6F2:EFC8BCE02A75588]:0) [junit4]>at org.apache.solr.security.BasicAuthIntegrationTest.testBasicAuth(BasicAuthIntegrationTest.java:151) {noformat} ...note the seed mentioned in the reproduce line. that's an example of a seed that fails 100% reliably (on my machine) on master as of both HEAD and 631edee1cba00d7fa41ac6e8c597a467db56346d, but does not fail on the previous commit ( 7d716f11075f0868535c108b21256a3b91b4a154 ) There are other dozens of seeds from recent jenkins failures that reliably reproduce in the same way (NOTE: i did not bisect test them all, but i did manually test a few of them against 631edee1cba00d7fa41ac6e8c597a467db56346d and 7d716f11075f0868535c108b21256a3b91b4a154) Considering how many "addressing test failures" commits you've had to make as part of this issue just to address the TestContainerReqHandler failures it introduced, not to mention these BasicAuthIntegrationTest we've now identified, I would strongly urge you to please: # *COMPLETLEY* revert all commits made as part of this issue to date # refrain from re-committing any related changes until you have a chance to beast all tests with multiple seeds, since this clear impacts the entire code base in ways you evidently didn't anticipate # once you have a unified set of changes w/working tests, re-commit only to master, and give it a few days to ensure no related failures, before backporting to branch_8x > Enable loading of plugins from the corecontainer memclassloader > --- > > Key: SOLR-13637 > URL: https://issues.apache.org/jira/browse/SOLR-13637 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > When we update jars or add/modify plugins no core reloading should be > required .Core reloading is a very expensive operation. Optionally, we can > just make the plugin depend on the corecontainer level classloader. > {code:xml} > runtimeLib="global"> > > > {code} > or alternately using the config API > {code:json} > curl -X POST -H 'Content-type:application/json' --data-binary ' > { > "create-queryparser": { > "name": "mycustomQParser" , > "class" : "my.path.to.ClassName", > "runtimeLib" : "global" > } > }' http://localhost:8983/api/c/mycollection/config > {code} > The global classloader is the corecontainer level classloader . So whenever > this is reloaded The component gets reloaded. The only caveat is, this > component cannot use core specific jars. > We will deprecate the {{runtimeLib = true/false}} o
[jira] [Commented] (SOLR-13579) Create resource management API
[ https://issues.apache.org/jira/browse/SOLR-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892241#comment-16892241 ] Hoss Man commented on SOLR-13579: - I spent some time breifly skimming the patch, and TBH got lost very quickly. I think it would be helpful (probably to more folks then just myself) if we could discuss, in "story" form, some (existing or hypothetical) examples of scenerios that could come up; how this new system would be helpful & behave in those scenerios, and what classes/objects (either in this patch, or yet to be written) would be responsible for each bit of action/reaction in those stories. ie: I'm a solr cluster admin and I have some existing collections using the (existing) default cache configurations. When/why might i want to setup some pools? what types of steps would i take to do so? how would my configuration(s) change? After i have some pools in place, what's an example of something that might happen during runtime that would cause the ResourceManager to "do something" with my pools/caches? what would that "do something" look like in terms of method call stacks? what would the effective end result be from my perspective as an external observer? Some specific bits that confuse me as i try to wrap my head around the current patch... * If each named "pool" has exactly one ResourceManagerPlugin that contains the (type specific) actual logic for managinging "the pool" (and the resources using that pool) then why is the "ResourceManagerPool" class different from the "ResourceManagerPlugin" class? ** as opposed to combining that logic into a single common base class? ** is there a one-to-many/many-to-one relationship between them that i'm not understanding? * can you elaborate on this comment with some concrete examples: {quote}Each managed resource can be managed by multiple types of plugins and it may appear in multiple pools (of different types). This reflects the fact that a single component may have multiple aspects of resource management - eg. cache mgmt, cpu, threads, etc. {quote} ** ie: if "CacheManagerPlugin.TYPE" is one "type" of pool that a SolrCache (implements ManagedResource) might be managed by, what would another hypothetical "type" of plugin/pool be that SolrCache might also be a part of? *** or if you can't think of a good example of two diff types that a SolrCache would be managed by, any example of an concept/object in solr that might becom a "ManagedResource" that could be managed by two differnt types of polugins as part of 2 diff pools would be helpful ** What happens if a single ManagedResource is part of two different "pools" with two different ResourceManagerPlugins that give conflicting/overlapping instructions? * regarding this comment... {quote}Each pool also has plugin-specific parameters, most notably the limits - eg. max total cache size, which the CacheManagerPlugin knows how to use in order to adjust cache sizes. {quote} ** does that imply that once SolrCache(s) are part of a "pool" they no longer have their own max size(s) ? or is the configured max size of an individual cache(s) still a hard upper bound on the "managed size" that might be set at runtime as the plugins fire? ** how/where would someone specify a "preference" for ensuring that if a "pool" is "full" that certain resources should be managed more agressively then others – ex: imagine a cluster admin wants all collections to have SolrCaches that are "as big as possible" given the resources of the machines, but wants to give priority to a certain subset of the "important" collections if resources get constrained; what/where would that be done? Also, FYI: with this patch, we now have 2 "ManagedResource" classes in solr/core that have absolutely nothing to do with each other... {noformat} $ find -name ManagedResource.java ./solr/core/src/java/org/apache/solr/rest/ManagedResource.java ./solr/core/src/java/org/apache/solr/managed/ManagedResource.java {noformat} ...thats a little weird. > Create resource management API > -- > > Key: SOLR-13579 > URL: https://issues.apache.org/jira/browse/SOLR-13579 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Attachments: SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch, > SOLR-13579.patch, SOLR-13579.patch > > > Resource management framework API supporting the goals outlined in SOLR-13578. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13375) 2 Dimensional Routed Aliases
[ https://issues.apache.org/jira/browse/SOLR-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888371#comment-16888371 ] Hoss Man commented on SOLR-13375: - The new DimensionalRoutedAliasUpdateProcessorTest appears to have some reliably reproducible bugs. In the last 24 hours... {quote} Class: org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest Method: testCatTime Failures: 31.43% (11 / 35) * thetaphi/Lucene-Solr-master-Windows/8059 (x6) * apache/Lucene-Solr-repro/3446 (x5) {quote} The seeds from both of those jenkins jobs reproduce for me locally (first try)... {noformat} $ ant test -Dtestcase=DimensionalRoutedAliasUpdateProcessorTest -Dtests.method=testCatTime -Dtests.seed=DD5BB3E097BBD0B4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=br -Dtests.timezone=Asia/Dushanbe -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 $ ant test -Dtestcase=DimensionalRoutedAliasUpdateProcessorTest -Dtests.method=testCatTime -Dtests.seed=21AAE082AEE603F0 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-8.x/test-data/enwiki.random.lines.txt -Dtests.locale=it -Dtests.timezone=Pacific/Port_Moresby -Dtests.asserts=true -Dtests.file.encoding=US-ASCII {noformat} more failures from the past 7 days... {quote} Class: org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest Method: testCatTime Failures: 18.95% (29 / 153) * apache/Lucene-Solr-repro/3446 (x5) * sarowe/Lucene-Solr-reproduce-failed-tests/8776 * thetaphi/Lucene-Solr-8.x-Windows/371 (x6) * thetaphi/Lucene-Solr-master-Windows/8059 (x6) * sarowe/Lucene-Solr-tests-master/21358 * sarowe/Lucene-Solr-tests-master/21339 * sarowe/Lucene-Solr-tests-master/21355 * apache/Lucene-Solr-NightlyTests-8.x/153 * thetaphi/Lucene-Solr-8.x-Linux/879 * thetaphi/Lucene-Solr-8.x-Solaris/240 (x6) {quote} {quote} > 2 Dimensional Routed Aliases > > > Key: SOLR-13375 > URL: https://issues.apache.org/jira/browse/SOLR-13375 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: master (9.0) >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Attachments: SOLR-13375.patch, SOLR-13375.patch, SOLR-13375.patch, > SOLR-13375.patch, SOLR-13375.patch > > > Current available routed aliases are restricted to a single field. This > feature will allow Solr to provide data driven collection access, creation > and management based on multiple fields in a document. The collections will > be queried and updated in a unified manner via an alias. Current routing is > restricted to the values of a single field. The particularly useful > combination at this time will be Category X Time routing but Category X > Category may also be useful. More importantly, if additional routing schemes > are created in the future (either as contributions or as custom code by > users) combination among these should be supported. > It is expected that not all combinations will be useful, and that > determination of usefulness I expect to leave up to the user. Some Routing > schemes may need to be limited to be the leaf/last routing scheme for > technical reasons, though I'm not entirely convinced of that yet. If so, a > flag will be added to the RoutedAlias interface. > Initial desire is to support two levels, though if arbitrary levels can be > supported easily that will be done. > This could also have been called CompositeRoutedAlias, but that creates a TLA > clash with CategoryRoutedAlias. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8908) Specified default value not returned for query() when doc doesn't match
[ https://issues.apache.org/jira/browse/LUCENE-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887518#comment-16887518 ] Hoss Man commented on LUCENE-8908: -- [~munendrasn] at first glance this looks good ... but i'm wondering if this actually fixes all of the examples i mentioned when this was opened -- in particular things like {{exists(query($qx,0))}} vs {{exists(query($qx))}} ... those should return different things depending on whether the doc matches $qx or not. IIRC that would require modifying QueryDocValues.exists() to return "true" anytime there is a defVal, but i don't think that's really possible ATM because it's a {float}} (not a nullable Float) ... and i'm not sure off the top of my head that it would even be the ideal behavior for the QueryDocValues code ? ... maybe the solr ValueSourceParser logic should be changed to put an explicit wrapper around the QueryValueSource when a default is (isn't?) used ... not sure, i haven't looked / thought about this code in a long time. > Specified default value not returned for query() when doc doesn't match > --- > > Key: LUCENE-8908 > URL: https://issues.apache.org/jira/browse/LUCENE-8908 > Project: Lucene - Core > Issue Type: Bug >Reporter: Bill Bell >Priority: Major > Attachments: LUCENE-8908.patch, SOLR-7845.patch, SOLR-7845.patch > > > The 2 arg version of the "query()" was designed so that the second argument > would specify the value used for any document that does not match the query > pecified by the first argument -- but the "exists" property of the resulting > ValueSource only takes into consideration wether or not the document matches > the query -- and ignores the use of the second argument. > > The work around is to ignore the 2 arg form of the query() function, and > instead wrap he query function in def(). > for example: {{def(query($something), $defaultval)}} instead of > {{query($something, $defaultval)}} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886513#comment-16886513 ] Hoss Man commented on LUCENE-8920: -- [~sokolov] - your revert on branch_8_2 seems to have broken most of the lucene/analysis/kuromoji tests with a common root cause... {noformat} [junit4] ERROR 0.44s J0 | TestFactories.test <<< [junit4]> Throwable #1: java.lang.ExceptionInInitializerError [junit4]>at __randomizedtesting.SeedInfo.seed([B1B94D34D92CDA93:39ED72EE77D0B76B]:0) [junit4]>at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62) [junit4]>at org.apache.lucene.analysis.ja.JapaneseTokenizer.(JapaneseTokenizer.java:215) [junit4]>at org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:150) [junit4]>at org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:82) [junit4]>at org.apache.lucene.analysis.ja.TestFactories$FactoryAnalyzer.createComponents(TestFactories.java:174) [junit4]>at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:427) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546) [junit4]>at org.apache.lucene.analysis.ja.TestFactories.doTestTokenizer(TestFactories.java:81) [junit4]>at org.apache.lucene.analysis.ja.TestFactories.test(TestFactories.java:60) [junit4]>at java.lang.Thread.run(Thread.java:748) [junit4]> Caused by: java.lang.RuntimeException: Cannot load TokenInfoDictionary. [junit4]>at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.(TokenInfoDictionary.java:71) [junit4]>... 46 more [junit4]> Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource org.apache.lucene.store.InputStreamDataInput@5f0dbb2f): 7 (needs to be between 6 and 6) [junit4]>at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216) [junit4]>at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198) [junit4]>at org.apache.lucene.util.fst.FST.(FST.java:275) [junit4]>at org.apache.lucene.util.fst.FST.(FST.java:263) [junit4]>at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:47) [junit4]>at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:54) [junit4]>at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:32) [junit4]>at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.(TokenInfoDictionary.java:69) [junit4]>... 46 more {noformat} ...perhaps due to "conflicting reverts" w/ LUCENE-8907 / LUCENE-8778 ? /cc [~tomoko] > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-13534) Dynamic loading of jars from a url
[ https://issues.apache.org/jira/browse/SOLR-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened SOLR-13534: - Noble: the new TestDynamicLoadingUrl can fail easily on heavily loaded machines when loading the "new" SolrCore instances (that result from reloading all cores after a config change) may not complete prior to the subsequent REST calls that depend on them In many recent jenkins failures this even happens as a result of the first SolrCore reload needed when executing the {{'create-requesthandler'}} config command to register the {{/jarhandler}} used by the test to "fake" a remote server. Here's an example of the order of the "going to send config command" logging from the test compared to the "CLOSING SolrCore" logging (indicating that the _new_ reloaded version of the core is already live and ready for requests) in a recent jenkins failure... {noformat} $ sed -n -e '/o.a.s.c.TestSolrConfigHandler/,+1p' -e '/config update listener called/p' -e '/CLOSING SolrCore/p' thetaphi_Lucene-Solr-master-MacOSX_5259.log.txt [junit4] 2> 549681 INFO (TEST-TestDynamicLoadingUrl.testDynamicLoadingUrl-seed#[B56AB6A699945C60]) [ ] o.a.s.c.TestSolrConfigHandler going to send config command. path /config , payload: { [junit4] 2> 'create-requesthandler' : { 'name' : '/jarhandler', 'class': org.apache.solr.core.TestDynamicLoadingUrl$JarHandler, registerPath: '/solr,/v2' } [junit4] 2> 549688 INFO (Thread-2015) [ ] o.a.s.c.SolrCore config update listener called for core collection1_shard2_replica_n1 [junit4] 2> 549688 INFO (Thread-2014) [ ] o.a.s.c.SolrCore config update listener called for core collection1_shard1_replica_n2 [junit4] 2> 549689 INFO (Thread-2013) [ ] o.a.s.c.SolrCore config update listener called for core control_collection_shard1_replica_n1 [junit4] 2> 549689 INFO (Thread-2017) [ ] o.a.s.c.SolrCore config update listener called for core collection1_shard2_replica_n5 [junit4] 2> 549689 INFO (Thread-2016) [ ] o.a.s.c.SolrCore config update listener called for core collection1_shard1_replica_n6 [junit4] 2> 550916 INFO (Thread-2013) [n:127.0.0.1:56493_m c:control_collection s:shard1 r:core_node2 x:control_collection_shard1_replica_n1 ] o.a.s.c.SolrCore [control_collection_shard1_replica_n1] CLOSING SolrCore org.apache.solr.core.SolrCore@3bceb4b6 [junit4] 2> 550921 INFO (Thread-2015) [n:127.0.0.1:56521_m c:collection1 s:shard2 r:core_node3 x:collection1_shard2_replica_n1 ] o.a.s.c.SolrCore [collection1_shard2_replica_n1] CLOSING SolrCore org.apache.solr.core.SolrCore@724097d0 [junit4] 2> 550986 INFO (qtp1160063522-12178) [n:127.0.0.1:56525_m c:collection1 s:shard1 r:core_node4 x:collection1_shard1_replica_n2 ] o.a.s.c.SolrCore [collection1_shard1_replica_n2] CLOSING SolrCore org.apache.solr.core.SolrCore@6972fc14 [junit4] 2> 551100 INFO (TEST-TestDynamicLoadingUrl.testDynamicLoadingUrl-seed#[B56AB6A699945C60]) [ ] o.a.s.c.TestSolrConfigHandler going to send config command. path /config , payload: { [junit4] 2> 'add-runtimelib' : { 'name' : 'urljar', url : 'http://127.0.0.1:56525/m/collection1/jarhandler?wt=filestream' 'sha512':'e01b51de67ae1680a84a813983b1de3b592fc32f1a22b662fc9057da5953abd1b72476388ba342cad21671cd0b805503c78ab9075ff2f3951fdf75fa16981420'}} [junit4] 2> 55 INFO (TEST-TestDynamicLoadingUrl.testDynamicLoadingUrl-seed#[B56AB6A699945C60]) [ ] o.a.s.c.TestSolrConfigHandler going to send config command. path /config , payload: { [junit4] 2> 'add-runtimelib' : { 'name' : 'urljar', url : 'http://127.0.0.1:56531/m/collection1/jarhandler?wt=filestream' 'sha512':'d01b51de67ae1680a84a813983b1de3b592fc32f1a22b662fc9057da5953abd1b72476388ba342cad21671cd0b805503c78ab9075ff2f3951fdf75fa16981420'}} [junit4] 2> 551126 INFO (Thread-2016) [n:127.0.0.1:56536_m c:collection1 s:shard1 r:core_node8 x:collection1_shard1_replica_n6 ] o.a.s.c.SolrCore [collection1_shard1_replica_n6] CLOSING SolrCore org.apache.solr.core.SolrCore@3ff8c0a0 [junit4] 2> 551145 INFO (Thread-2017) [n:127.0.0.1:56531_m c:collection1 s:shard2 r:core_node7 x:collection1_shard2_replica_n5 ] o.a.s.c.SolrCore [collection1_shard2_replica_n5] CLOSING SolrCore org.apache.solr.core.SolrCore@11bde630 [junit4] 2> 551232 INFO (coreCloseExecutor-4027-thread-1) [n:127.0.0.1:56493_m c:control_collection s:shard1 r:core_node2 x:control_collection_shard1_replica_n1 ] o.a.s.c.SolrCore [control_collection_shard1_replica_n1] CLOSING SolrCore org.apache.solr.core.SolrCore@3c0a3773 [junit4] 2> 551234 INFO (coreCloseExecutor-4026-thread-1) [n:127.0.0.1:56521_m c:collection1 s:shard2 r:core_node3 x:collection1_shard2_replica_n1 ] o.a.s.c.SolrCore [collection1_shard2_replica_n1] CLOSING SolrCore org.apache.solr.core.SolrCore@735e18b7 [junit4] 2> 551235 INFO (coreCloseExecutor-4030-th
[jira] [Updated] (SOLR-13627) newly created collection can see all replicas go into recovery immediately on first document addition
[ https://issues.apache.org/jira/browse/SOLR-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13627: Attachment: apache_Lucene-Solr-NightlyTests-8.x_143.log.txt Status: Open (was: Open) I'm attaching the logs from {{apache_Lucene-Solr-NightlyTests-8.x_143.log.txt}} with the full test logs, but the gist of the situation is that these lines of code... {code} // NOTE: legacyCloud == false CollectionAdminRequest.setClusterProperty(ZkStateReader.LEGACY_CLOUD, legacyCloud).process(cluster.getSolrClient()); final String collectionName = "deleteFromClusterState_"+legacyCloud; CollectionAdminRequest.createCollection(collectionName, "conf", 1, 3) .process(cluster.getSolrClient()); cluster.waitForActiveCollection(collectionName, 1, 3); cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "1")); cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "2")); cluster.getSolrClient().commit(collectionName); {code} Results in the following key bits of logging... {noformat} # set the cluster prop... [junit4] 2> 4266069 INFO (qtp1645118155-93196) [n:127.0.0.1:38694_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :clusterprop with params val=false&name=legacyCloud&action=CLUSTERPROP&wt=javabin&version=2 and sendToOCPQueue=true # # Collection creation, w/leader election... # [junit4] 2> 4266070 INFO (qtp1645118155-93196) [n:127.0.0.1:38694_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={val=false&name=legacyCloud&action=CLUSTERPROP&wt=javabin&version=2} status=0 QTime=1 [junit4] 2> 4266071 INFO (qtp1645118155-93197) [n:127.0.0.1:38694_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with params collection.configName=conf&name=deleteFromClusterState_false&nrtReplicas=3&action=CREATE&numShards=1&wt=javabin&version=2 and sendToOCPQueue=true ... [junit4] 2> 4266757 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node5&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n2&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT [junit4] 2> 4266814 INFO (qtp1240928045-93187) [n:127.0.0.1:44654_solr x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node6&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n3&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT [junit4] 2> 4266832 INFO (qtp1645118155-93199) [n:127.0.0.1:38694_solr x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node4&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n1&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT ... [junit4] 2> 4268946 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node5=0}, version=0} ... [junit4] 2> 4269040 INFO (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying [junit4] 2> 4269040 INFO (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node4=0, core_node5=0}, version=1} [junit4] 2> 4269040 INFO (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.c.ShardLeaderElectionContextBase make sure parent is created /collections/deleteFromClusterState_false/leaders/shard1 ... [junit4] 2> 4269106 INFO (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node6=0, core_node4=0, core_node5=0}, version=2} ... [junit4] 2> 4269487 INFO (qtp154483
[jira] [Created] (SOLR-13627) newly created collection can see all replicas go into recovery immediately on first document addition
Hoss Man created SOLR-13627: --- Summary: newly created collection can see all replicas go into recovery immediately on first document addition Key: SOLR-13627 URL: https://issues.apache.org/jira/browse/SOLR-13627 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man There's something very weird going on that popped up in a recent jenkins run of {{DeleteReplicaTest.deleteReplicaFromClusterState}}. While the test has some issues of it's own, and ultimately failed due to a combination of SOLR-13616 + some sloppy assertions (which I will attempt address independently of this Jira) a more alarming situation is what the logs show at the _begining_ of the tests, before any problems occured. In a nutshell: *Just the act of creating a 1x3 collection and adding some docs to it caused both of the non-leader replicas to immediatley decide they needed to go into recovery.* Details to follow in comments... -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13532) Unable to start core recovery due to timeout in ping request
[ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13532: Resolution: Fixed Assignee: Hoss Man Fix Version/s: 8.3 8.2 master (9.0) Status: Resolved (was: Patch Available) Thanks Suril! > Unable to start core recovery due to timeout in ping request > > > Key: SOLR-13532 > URL: https://issues.apache.org/jira/browse/SOLR-13532 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 7.6 >Reporter: Suril Shah >Assignee: Hoss Man >Priority: Major > Fix For: master (9.0), 8.2, 8.3 > > Attachments: SOLR-13532.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Discovered following issue with the core recovery: > * Core recovery is not being initialized and throwing following exception > message : > {code:java} > 2019-06-07 00:53:12.436 INFO > (recoveryExecutor-4-thread-1-processing-n::8983_solr > x:_shard41_replica_n2777 c: s:shard41 > r:core_node2778) x:_shard41_replica_n2777 > o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr > on recovery, try again{code} > * Above error occurs when ping request takes time more than a timeout period > which is hard-coded to one second in solr source code. However In a general > production setting it is common to have ping time more than one second, > hence, the core recovery never starts and exception is thrown. > * Also the other major concern is that this exception is logged as an info > message, hence it is very difficult to identify the error if info logging is > not enabled. > * Please refer to following code snippet from the [source > code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] > to understand the above issue. > {code:java} > try (HttpSolrClient httpSolrClient = new > HttpSolrClient.Builder(leaderReplica.getCoreUrl()) > .withSocketTimeout(1000) > .withConnectionTimeout(1000) > > .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient()) > .build()) { > SolrPingResponse resp = httpSolrClient.ping(); > return leaderReplica; > } catch (IOException e) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } catch (Exception e) { > if (e.getCause() instanceof IOException) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } else { > return leaderReplica; > } > } > {code} > The above issue will have high impact in production level clusters, since > cores not being able to recover may lead to data loss. > Following improvements would be really helpful: > 1. The [timeout for ping > request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791] > in *RecoveryStrategy.java* should be configurable and the defaults set to > high values like 15seconds. > 2. The exception message in [line > 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797] > and [line > 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801] > in *RecoveryStrategy.java* should be logged as *error* messages instead of > *info* messages -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)
[ https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883157#comment-16883157 ] Hoss Man commented on SOLR-13616: - not sure why/how gitbox missed the 8x cherry-pick: https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=81b2e06ffe6bddcd8d25b24c79683281da85baee > Possible racecondition/deadlock between collection DELETE and PrepRecovery ? > (TestPolicyCloud failures) > --- > > Key: SOLR-13616 > URL: https://issues.apache.org/jira/browse/SOLR-13616 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-13616.test-incomplete.patch, > thetaphi_Lucene-Solr-master-Linux_24358.log.txt > > > Based on some recent jenkins failures in TestPolicyCloud, I suspect there is > a possible deadlock condition when attempting to delete a collection while > recovery is in progress. > I haven't been able to identify exactly where/why/how the problem occurs, but > it does not appear to be a test specific problem, and seems like it could > potentially affect anyone unlucky enough to issue poorly timed DELETE. > Details to follow in comments... -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)
[ https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882245#comment-16882245 ] Hoss Man commented on SOLR-13616: - {quote}I'm not sure we should change the waitForState logic to rethrow Exceptions or revert back PrepRecoveryOp to its previous version ... {quote} {quote}Hoss and Dat – thank you for investigating this! All usages of CollectionStateWatcher or LiveNodesWatcher will suffer from this problem i.e. the thread that runs the watcher swallows the exception ... {quote} Well, generally speaking there isn't any way (i can think of) for the thread executing a Watcher to do anything _but_ swallow any exceptions from the watcher – it can't propogated it back to the "caller" of registrWatcher or anything like that .. if the caller wanted to be informed then the Watcher it registered should be catching the exceptions itself. But to Dat's point: in the specific case of {{waitForState}} – there ZkStateReader *is* creating it's own Watcher to wrap the input Predicate, and we could in fact make waitForState do something inside that Watcher that catches any Exception thrown by the Predicate and short circuts out of the {{waitForState}} call, wrapping/re-throwing the exception in the meantime. But those seem like "broader" problems with regards to where/how the different callers are using the Watcher/waitForState APIs that we should probably create a new issue to track (for auditing all of them and clarifying the behavior in the javadocs) ... frankly i think in this specific jira we should be asking a lot more questions about the _specific_ predicate used in PrepForRecovery's waitForState call ... notably what exactly is the expectation here when the SolrCore (that prepRecovery wants to recover from) can't be found _in the local CoreContainer_ ... deleting the collection is just one example, are there other situations where the core may not be found at this point in the code? (node shutdown perhaps? autoscaling removing a replica) ? what about a few lines later... {code:java} if (onlyIfLeader != null && onlyIfLeader) { if (!core.getCoreDescriptor().getCloudDescriptor().isLeader()) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "We are not the leader"); } } {code} ...even if the SolrCore is found, if we expect it to be the shard leader, and it's not (what if there has beena leader election in the meantime?) then that's another type of problem that will also cause the predicate to throw an exception that will (aparently) cause PrepRecovery to stall. what should PrepRecovery do here? i suspect that in general the use of waitForState here in PrepRecoveryOp is "ok in concept" ... we just need to make the predicate smarter about exiting immeidately in these situations instead of throwing an exception that gets swallowed ... i'm just not sure what the right behavior for PrepRecovery *is* in these sitautions. I don't suppose either of you were able to spot what's "wrong" with my test that it doesn't force a failure in this situation? > Possible racecondition/deadlock between collection DELETE and PrepRecovery ? > (TestPolicyCloud failures) > --- > > Key: SOLR-13616 > URL: https://issues.apache.org/jira/browse/SOLR-13616 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-13616.test-incomplete.patch, > thetaphi_Lucene-Solr-master-Linux_24358.log.txt > > > Based on some recent jenkins failures in TestPolicyCloud, I suspect there is > a possible deadlock condition when attempting to delete a collection while > recovery is in progress. > I haven't been able to identify exactly where/why/how the problem occurs, but > it does not appear to be a test specific problem, and seems like it could > potentially affect anyone unlucky enough to issue poorly timed DELETE. > Details to follow in comments... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12368) in-place DV updates should no longer have to jump through hoops if field does not yet exist
[ https://issues.apache.org/jira/browse/SOLR-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882226#comment-16882226 ] Hoss Man commented on SOLR-12368: - Hey [~munendrasn] - patch functionality looks good to me, but i'm confused by your last comment... bq. As part of this issue, I will commit only solr changes and raise lucene issue for deprecating and removing IndexWriter#getFieldNames ...but in your latest patch, IndexWriter.getFieldNames (and the underlying FieldInfos method) are still being removed ... shouldn't those be moved to a new (linked) issue/patch so that the commit for _this_ issue can be trivially backported? > in-place DV updates should no longer have to jump through hoops if field does > not yet exist > --- > > Key: SOLR-12368 > URL: https://issues.apache.org/jira/browse/SOLR-12368 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12368.patch, SOLR-12368.patch, SOLR-12368.patch > > > When SOLR-5944 first added "in-place" DocValue updates to Solr, one of the > edge cases thta had to be dealt with was the limitation imposed by > IndexWriter that docValues could only be updated if they already existed - if > a shard did not yet have a document w/a value in the field where the update > was attempted, we would get an error. > LUCENE-8316 seems to have removed this error, which i believe means we can > simplify & speed up some of the checks in Solr, and support this situation as > well, rather then falling back on full "read stored fields & reindex" atomic > update -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)
[ https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13616: Attachment: SOLR-13616.test-incomplete.patch thetaphi_Lucene-Solr-master-Linux_24358.log.txt Status: Open (was: Open) Unlike most tests that explicitly waitFor/assert *active* replicas, TestPolicyCloud (currently) has several tests that only assert the quanity and location of a replica – it doesn't wait for them to become active, so when testing an ADDREPLICA or a SPLITSHARD, those new replicas are still in recovery (or PrepRecovery) when the test tries to do cleanup and delete the collection – which frequently fails with timeout problems. While we can certainly "improve" TestPolicyCloud to wait for recoveries to finish, and all replicas to be active before attempting to delete the collection, a better question is why this is needed? I'm attaching {{thetaphi_Lucene-Solr-master-Linux_24358.log.txt}} which demonstrates the problem in {{TestPolicyCloud.testCreateCollectionAddReplica}} here are some highlights... {noformat} # thetaphi_Lucene-Solr-master-Linux_24358.log.txt # # testCreateCollectionAddReplica # bulk of test logic is finished, test has added a replica and confirmed it's on the expected node # # but meanwhile, recovery is still ongoing... [junit4] 2> 959699 INFO (recoveryExecutor-5888-thread-1-processing-n:127.0.0.1:42097_solr x:testCreateCollectionAddReplica_shard1_replica_n3 c:testCreateCollectionAddReplica s:shard1 r:core_node4) [n:127.0.0.1:42097_solr c:testCreateCollectionAddReplica s:shard1 r:core_node4 x:testCreateCollectionAddReplica_shard1_replica_n3 ] o.a.s.c.RecoveryStrategy Sending prep recovery command to [https://127.0.0.1:42097/solr]; [WaitForState: action=PREPRECOVERY&core=testCreateCollectionAddReplica_shard1_replica_n1&nodeName=127.0.0.1:42097_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] [junit4] 2> 959701 INFO (qtp531873617-17025) [n:127.0.0.1:42097_solr x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true [junit4] 2> 959701 INFO (qtp531873617-17025) [n:127.0.0.1:42097_solr x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=testCreateCollectionAddReplica, shard=shard1, thisCore=testCreateCollectionAddReplica_shard1_replica_n1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=down, localState=active, nodeName=127.0.0.1:42097_solr, coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{ [junit4] 2> "core":"testCreateCollectionAddReplica_shard1_replica_n3", [junit4] 2> "base_url":"https://127.0.0.1:42097/solr";, [junit4] 2> "state":"down", [junit4] 2> "node_name":"127.0.0.1:42097_solr", [junit4] 2> "type":"NRT"} ... # the test thread moves on to @After method which calls MiniSolrCloudCluster.deleteAllCollections() ... ... [junit4] 2> 959703 INFO (qtp531873617-17021) [n:127.0.0.1:42097_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params name=testCreateCollectionAddReplica&action=DELETE&wt=javabin&version=2 and sendToOCPQueue=true ... [junit4] 2> 959709 INFO (OverseerThreadFactory-5345-thread-5-processing-n:127.0.0.1:44991_solr) [n:127.0.0.1:44991_solr ] o.a.s.c.a.c.OverseerCollectionMessageHandler Executing Collection Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true, asyncId=null ... [junit4] 2> 959750 INFO (qtp531873617-17325) [n:127.0.0.1:42097_solr x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.c.SolrCore [testCreateCollectionAddReplica_shard1_replica_n1] CLOSING SolrCore org.apache.solr.core.SolrCore@39444e66 ... [junit4] 2> 959753 INFO (qtp531873617-17325) [n:127.0.0.1:42097_solr x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&deleteMetricsHistory=true&core=testCreateCollectionAddReplica_shard1_replica_n1&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2} status=0 QTime=17 # but meanwhile, PrepRecoveryOp is currently blocked on a call to ZkStateReader.waitForState # looking for specific conditions for the leader and (new) replica (that needs to recover) # ... BUT!... the leader core has already been closed, so the watcher never succeeds, # ...so PrepRecovery keeps waitForState ... [junit4] 2> 959855 WARN (watches-5915-thread-1) [ ] o.a.s.c.c.ZkStateReader Error on calling watcher [junit4] 2> => org.apache.solr.common.SolrException: core not found:testC
[jira] [Created] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)
Hoss Man created SOLR-13616: --- Summary: Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures) Key: SOLR-13616 URL: https://issues.apache.org/jira/browse/SOLR-13616 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man Based on some recent jenkins failures in TestPolicyCloud, I suspect there is a possible deadlock condition when attempting to delete a collection while recovery is in progress. I haven't been able to identify exactly where/why/how the problem occurs, but it does not appear to be a test specific problem, and seems like it could potentially affect anyone unlucky enough to issue poorly timed DELETE. Details to follow in comments... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades
[ https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880481#comment-16880481 ] Hoss Man commented on SOLR-13599: - this is the epitome of a heisenbug ... 5 days ago i commit a change to master that adds a bit of extra logging to the test, and since then there hasn't been a single master fail -- but in the same about of time, 7/10 of the 8x builds have failed, and all but one of those reproduced 3x (or more) times. not sure what to do here except backport the loging changes to 8x, and hope we get another failure eventaully so we'll have something to diagnose. > ReplicationFactorTest high failure rate on Windows jenkins VMs after > 2019-06-22 OS/java upgrades > > > Key: SOLR-13599 > URL: https://issues.apache.org/jira/browse/SOLR-13599 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: thetaphi_Lucene-Solr-master-Windows_8025.log.txt > > > We've started seeing some weirdly consistent (but not reliably reproducible) > failures from ReplicationFactorTest when running on Uwe's Windows jenkins > machines. > The failures all seem to have started on June 22 -- when Uwe upgraded his > Windows VMs to upgrade the Java version, but happen across all versions of > java tested, and on both the master and branch_8x. > While this test failed a total of 5 times, in different ways, on various > jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on > all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and > when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins > builds frequently fails anywhere from 1-4 additional times. > All of these failures occur in the exact same place, with the exact same > assertion: that the expected replicationFactor of 2 was not achieved, and an > rf=1 (ie: only the master) was returned, when sending a _batch_ of documents > to a collection with 1 shard, 3 replicas; while 1 of the replicas was > partitioned off due to a closed proxy. > In the handful of logs I've examined closely, the 2nd "live" replica does in > fact log that it recieved & processed the update, but with a QTime of over 30 > seconds, and it then it immediately logs an > {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- > meanwhile, the leader has one ({{updateExecutor}} thread logging copious > amount of {{java.net.ConnectException: Connection refused: no further > information}} regarding the replica that was partitioned off, before a second > {{updateExecutor}} thread ultimately logs > {{java.util.concurrent.ExecutionException: > java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" > replica. > > What makes this perplexing is that this is not the first time in the test > that documents were added to this collection while one replica was > partitioned off, but it is the first time that all 3 of the following are > true _at the same time_: > # the collection has recovered after some replicas were partitioned and > re-connected > # a batch of multiple documents is being added > # one replica has been "re" partitioned. > ...prior to the point when this failure happens, only individual document > adds were tested while replicas where partitioned. Batches of adds were only > tested when all 3 replicas were "live" after the proxies were re-opened and > the collection had fully recovered. The failure also comes from the first > update to happen after a replica's proxy port has been "closed" for the > _second_ time. > While this conflagration of events might concievible trigger some weird bug, > what makes these failures _particularly_ perplexing is that: > * the failures only happen on Windows > * the failures only started after the Windows VM update on June-22. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13532) Unable to start core recovery due to timeout in ping request
[ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13532: Status: Patch Available (was: Open) > Unable to start core recovery due to timeout in ping request > > > Key: SOLR-13532 > URL: https://issues.apache.org/jira/browse/SOLR-13532 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 7.6 >Reporter: Suril Shah >Priority: Major > Attachments: SOLR-13532.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Discovered following issue with the core recovery: > * Core recovery is not being initialized and throwing following exception > message : > {code:java} > 2019-06-07 00:53:12.436 INFO > (recoveryExecutor-4-thread-1-processing-n::8983_solr > x:_shard41_replica_n2777 c: s:shard41 > r:core_node2778) x:_shard41_replica_n2777 > o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr > on recovery, try again{code} > * Above error occurs when ping request takes time more than a timeout period > which is hard-coded to one second in solr source code. However In a general > production setting it is common to have ping time more than one second, > hence, the core recovery never starts and exception is thrown. > * Also the other major concern is that this exception is logged as an info > message, hence it is very difficult to identify the error if info logging is > not enabled. > * Please refer to following code snippet from the [source > code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] > to understand the above issue. > {code:java} > try (HttpSolrClient httpSolrClient = new > HttpSolrClient.Builder(leaderReplica.getCoreUrl()) > .withSocketTimeout(1000) > .withConnectionTimeout(1000) > > .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient()) > .build()) { > SolrPingResponse resp = httpSolrClient.ping(); > return leaderReplica; > } catch (IOException e) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } catch (Exception e) { > if (e.getCause() instanceof IOException) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } else { > return leaderReplica; > } > } > {code} > The above issue will have high impact in production level clusters, since > cores not being able to recover may lead to data loss. > Following improvements would be really helpful: > 1. The [timeout for ping > request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791] > in *RecoveryStrategy.java* should be configurable and the defaults set to > high values like 15seconds. > 2. The exception message in [line > 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797] > and [line > 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801] > in *RecoveryStrategy.java* should be logged as *error* messages instead of > *info* messages -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13532) Unable to start core recovery due to timeout in ping request
[ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13532: Attachment: SOLR-13532.patch Status: Open (was: Open) bq. The other alternative to this would be to update the {{RecoveryStrategy}} code to use something like {{cc.getConfig().getUpdateShardHandlerConfig()}} ... Here's a variant of Suril's patch along those lines, with some refactoring to put the logic into a helper method. I don't love it -- but i don't hate it either. I'm still running tests to make sure i didn't break anything, but in the meantime what do folks think? ... can anyone see any problems with this approach? ([~surilshah]: does this patch -- and the usage of the solr.xml configures values instead of hardcoded magic constants -- solvethe problems you're seeing?) > Unable to start core recovery due to timeout in ping request > > > Key: SOLR-13532 > URL: https://issues.apache.org/jira/browse/SOLR-13532 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 7.6 >Reporter: Suril Shah >Priority: Major > Attachments: SOLR-13532.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Discovered following issue with the core recovery: > * Core recovery is not being initialized and throwing following exception > message : > {code:java} > 2019-06-07 00:53:12.436 INFO > (recoveryExecutor-4-thread-1-processing-n::8983_solr > x:_shard41_replica_n2777 c: s:shard41 > r:core_node2778) x:_shard41_replica_n2777 > o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr > on recovery, try again{code} > * Above error occurs when ping request takes time more than a timeout period > which is hard-coded to one second in solr source code. However In a general > production setting it is common to have ping time more than one second, > hence, the core recovery never starts and exception is thrown. > * Also the other major concern is that this exception is logged as an info > message, hence it is very difficult to identify the error if info logging is > not enabled. > * Please refer to following code snippet from the [source > code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] > to understand the above issue. > {code:java} > try (HttpSolrClient httpSolrClient = new > HttpSolrClient.Builder(leaderReplica.getCoreUrl()) > .withSocketTimeout(1000) > .withConnectionTimeout(1000) > > .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient()) > .build()) { > SolrPingResponse resp = httpSolrClient.ping(); > return leaderReplica; > } catch (IOException e) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } catch (Exception e) { > if (e.getCause() instanceof IOException) { > log.info("Failed to connect leader {} on recovery, try again", > leaderReplica.getBaseUrl()); > Thread.sleep(500); > } else { > return leaderReplica; > } > } > {code} > The above issue will have high impact in production level clusters, since > cores not being able to recover may lead to data loss. > Following improvements would be really helpful: > 1. The [timeout for ping > request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791] > in *RecoveryStrategy.java* should be configurable and the defaults set to > high values like 15seconds. > 2. The exception message in [line > 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797] > and [line > 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801] > in *RecoveryStrategy.java* should be logged as *error* messages instead of > *info* messages -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13457) Managing Timeout values in Solr
[ https://issues.apache.org/jira/browse/SOLR-13457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878183#comment-16878183 ] Hoss Man commented on SOLR-13457: - SOLR-13605 shows some more of the madness involved in how these settings are borked -- even if you just focus on the the SolrJ APIs for specifying things (notably {{HttpSolrClient.Builder.withHttpClient}}) w/o even considering how *solr* should use those SolrJ APIs based on the things like {{solr.xml}} > Managing Timeout values in Solr > --- > > Key: SOLR-13457 > URL: https://issues.apache.org/jira/browse/SOLR-13457 > Project: Solr > Issue Type: Improvement >Affects Versions: master (9.0) >Reporter: Gus Heck >Priority: Major > > Presently, Solr has a variety of timeouts for various connections or > operations. These timeouts have been added, tweaked and refined and in some > cases made configurable in an ad-hoc manner by the contributors of individual > features. Throughout the history of the project. This is all well and good > until one experiences a timeout during an otherwise valid use case and needs > to adjust it. > This has also made managing timeouts in unit tests "interesting" as noted in > SOLR-13389. > Probably nobody has the spare time to do a tour de force through the code and > coordinate every single timeout, so in this ticket I'd like to establish a > framework for categorizing time outs, a standard for how we make each > category configurable, and then add sub-tickets to address individual > timeouts. > The intention is that eventually, there will be no "magic number" timeout > values in code, and one can predict where to find the configuration for a > timeout by determining it's category. > Initial strawman categories (feel free to knock down or suggest alternatives): > # *Feature-Instance Timeout*: Timeouts that relate to a particular > instantiation of a feature, for example a database connection timeout for a > connection to a particular database by DIH. These should be set in the > configuration of that instance. > # *Optional Feature Timeout*: A timeout that only has meaning in the context > of a particular feature that is not required for solr to function... i.e. > something that can be turned on or off. Perhaps a timeout for communication > with an external ldap for authentication purposes. These should be configured > in the same configuration that enables this feature. > # *Global System Timeout*: A timeout that will always be an active part of > Solr these should be configured in a new section of solr.xml. For > example the Jetty thread idle timeout, or the default timeout for http calls > between nodes. > # *Node Specific Timeout*: A timeout which may differ on different nodes. I > don't know of any of these, but I'll grant the possibility. These (and only > these) should be set by setting system properties. If we don't have any of > these, that's just fine :). > # *Client Timeout*: These are timeouts in solrj code that are active in code > running outside the server. They should be configurable via java api, and via > a config file of some sort from a single location defined in a sysprop or > sourced from classpath (in that order). When run on the server, the solrj > code should look for a *Global System Timeout* setting before consulting > sysprops or classpath. > *Note that in no case is a hard-coded value the correct solution.* > If we get a consensus on categories and their locations, then the next step > is to begin adding sub tickets to bring specific timeouts into compliance. > Every such ticket should include an update to the section of the ref guide > documenting the configuration to which the timeout has been added (e.g. docs > for solr.xml for Global System Timeouts) describing what exactly is affected > by the timeout, the maximum allowed value and how zero and negative numbers > are handled. > It is of course true that some of these values will have the potential to > destroy system performance or integrity, and that should be mentioned in the > update to documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13532) Unable to start core recovery due to timeout in ping request
[ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878180#comment-16878180 ] Hoss Man commented on SOLR-13532: - My first impression on seeing this patch was that I _really_ dislike the idea of "fixing" a hardcoded timeout by changing it to a _different_ hardcoded timeout – I would really much rather we use the existing {{solr.xml}} configured timeouts for this sort of thing. So then I went poking around the code to refresh my memory about how/where the SO & CONNECT timeouts config options for intranode requests get populated in the code to propose an alternative patch that uses them, and realized that we already have an {{UpdateShardHandler.getRecoveryOnlyHttpClient()}} method that returns an HttpClient pre-configured with the correct timeout values ... and then I realized that this is already used in the code in question via {{withHttpClient(...)}}... {code:java} // existing, pre-patch, code in RecoveryStrategy try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl()) .withSocketTimeout(1000) .withConnectionTimeout(1000) .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient()) {code} This {{UpdateShardHandler.getRecoveryOnlyHttpClient()}} concept, and that corresponding {{withHttpClient()}} call, was introduced *after* the original recovery code was written (with those hardcoed timeouts) ... In theory if we just remove the {{withSocketTimeout}} and {{withConnectionTimeout}} completely from this class, then the cluster's {{solr.xml}} configuration options should start getting used. But then I dug deeper and discovered that the way HttpSolrClient & it's Builder works is really silly and frustrating and causes the hardcoded values {{SolrClientBuilder.connectionTimeoutMillis = 15000}} and {{SolrClientBuilder.socketTimeoutMillis = 12}} to get used at the request level, even when {{withHttpClient}} has been called to set an {{HttpClient}} that already has the settings we want ... basically defeating a huge part of the value in {{withHttpClient}} ... even using values of {{null}} or {{-1}} won't work, because of other nonsensical ways that "default" values come into play I created SOLR-13605 to track the silliness in {{HttpClient.Builder}} – it's a bigger issue then just fixing this ping/recovery problem, and will require more careful consideration. As much as it pains me to say this: I think that for now, for the purpose of fixing the bug in this jira, we should just remove the {{withSocketTimeout(}} and {{withConnectionTimeout()}} calls completely, and defer to the (pre-existing) hardcoded defaults in {{SolrClientBuilder}} ... at least that way we're reducing the number of hardcoded defaults in the code, and if/when SOLR-13605 get's fixed, the {{solr.xml}} settings should take affect. The other alternative to this would be to update the {{RecoveryStrategy}} code to use something like {{cc.getConfig().getUpdateShardHandlerConfig()}} and then use {{UpdateShardHandlerConfig.getDistributedSocketTimeout()}} and {{UpdateShardHandlerConfig.getDistributedConnectionTimeout()}} to pass as the inputs to {{SolrHttpClient.Builder}} ... that seemed really silly and redundent when it first occured to me, but the more i think about it the more it's probably not that bad as a work around for SOLR-13605 until it's fixed. What do folks think? > Unable to start core recovery due to timeout in ping request > > > Key: SOLR-13532 > URL: https://issues.apache.org/jira/browse/SOLR-13532 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 7.6 >Reporter: Suril Shah >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Discovered following issue with the core recovery: > * Core recovery is not being initialized and throwing following exception > message : > {code:java} > 2019-06-07 00:53:12.436 INFO > (recoveryExecutor-4-thread-1-processing-n::8983_solr > x:_shard41_replica_n2777 c: s:shard41 > r:core_node2778) x:_shard41_replica_n2777 > o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr > on recovery, try again{code} > * Above error occurs when ping request takes time more than a timeout period > which is hard-coded to one second in solr source code. However In a general > production setting it is common to have ping time more than one second, > hence, the core recovery never starts and exception is thrown. > * Also the other major concern is that this exception is logged as an info > message, hence it is very difficult to identify the error if info logging is > not enabled. > * Please refer to following code snippet from the [source > cod
[jira] [Created] (SOLR-13605) HttpSolrClient.Builder.withHttpClient() is useless for the purpose of setting client scoped so/connect timeouts
Hoss Man created SOLR-13605: --- Summary: HttpSolrClient.Builder.withHttpClient() is useless for the purpose of setting client scoped so/connect timeouts Key: SOLR-13605 URL: https://issues.apache.org/jira/browse/SOLR-13605 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man TL;DR: trying to use {{HttpSolrClient.Builder.withHttpClient}} is useless for the the purpose of specifying an {{HttpClient}} with the default "timeouts" you want to use on all requests, because of how {{HttpSolrClient.Builder}} and {{HttpClientUtil.createDefaultRequestConfigBuilder()}} hardcode values thta get set on every {{HttpRequest}}. This internally affects code that uses things like {{UpdateShardHandler.getDefaultHttpClient()}}, {{UpdateShardHandler.getUpdateOnlyHttpClient()}} {{UpdateShardHandler.getRecoveryOnlyHttpClient()}}, etc... While looking into the patch in SOLR-13532, I realized that the way {{HttpSolrClient.Builder}} and it's super class {{SolrClientBuilder}} work, the following code doesn't do what a reasonable person would expect... {code:java} SolrParams clientParams = params(HttpClientUtil.PROP_SO_TIMEOUT, 12345, HttpClientUtil.PROP_CONNECTION_TIMEOUT, 67890); HttpClient httpClient = HttpClientUtil.createClient(clientParams); HttpSolrClient solrClient = new HttpSolrClient.Builder(ANY_BASE_SOLR_URL) .withHttpClient(httpClient) .build(); {code} When {{solrClient}} is used to execute a request, neither of the properties passed to {{HttpClientUtil.createClient(...)}} will matter - the {{HttpSolrClient.Builder}} (via inheritence from {{SolrClientBuilder}} has the following hardcoded values... {code:java} // SolrClientBuilder protected Integer connectionTimeoutMillis = 15000; protected Integer socketTimeoutMillis = 12; {code} ...which unless overridden by calls to {{withConnectionTimeout()}} and {{withSocketTimeout()}} will get set on the {{HttpSolrClient}} object, and used on every request... {code:java} // protected HttpSolrClient constructor this.connectionTimeout = builder.connectionTimeoutMillis; this.soTimeout = builder.socketTimeoutMillis; {code} It would be tempting to try and do something like this to work around the problem... {code:java} SolrParams clientParams = params(HttpClientUtil.PROP_SO_TIMEOUT, 12345, HttpClientUtil.PROP_CONNECTION_TIMEOUT, 67890); HttpClient httpClient = HttpClientUtil.createClient(clientParams); HttpSolrClient solrClient = new HttpSolrClient.Builder(ANY_BASE_SOLR_URL) .withHttpClient(httpClient) .withSocketTimeout(null) .withConnectionTimeout(null) .build(); {code} ...except for 2 problems: # In {{HttpSolrClient.executeMethod}}, if the values of {{this.connectionTimeout}} or {{this.soTimeout}} are null, then the values from {{HttpClientUtil.createDefaultRequestConfigBuilder();}} get used, which has it's own hardcoded defaults. # {{withSocketTimeout}} and {{withConnectionTimeout}} take an int, not a (nullable) Integer. So then maybe something like this would work? - particularly since at the {{HttpClient}} / {{HttpRequest}} / {{RequestConfig}} level, a "-1" set on the {{HttpRequest}}'s {{RequestConfig}} is suppose to mean "use the (client) default" ... {code:java} SolrParams clientParams = params(HttpClientUtil.PROP_SO_TIMEOUT, 12345, HttpClientUtil.PROP_CONNECTION_TIMEOUT, 67890); HttpClient httpClient = HttpClientUtil.createClient(clientParams); HttpSolrClient client = new HttpSolrClient.Builder(ANY_BASE_SOLR_URL) .withHttpClient(httpClient) .withSocketTimeout(-1) .withConnectionTimeout(-1) .build(); {code} ...except that if we do *that* we get an IllegalArgumentException... {code:java} // SolrClientBuilder public B withConnectionTimeout(int connectionTimeoutMillis) { if (connectionTimeoutMillis < 0) { throw new IllegalArgumentException("connectionTimeoutMillis must be a non-negative integer."); } {code} This is madness, and eliminates most/all of the known value of using {{.withHttpClient}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades
[ https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13599: Attachment: thetaphi_Lucene-Solr-master-Windows_8025.log.txt Status: Open (was: Open) Details of Uwe's jenkins updates... * http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E * http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/%3C01a901d530a7$fac9d2a0$f05d77e0$@thetaphi.de%3E * http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/raw/%3C01a901d530a7$fac9d2a0$f05d77e0$@thetaphi.de%3E/4 I'm attaching thetaphi_Lucene-Solr-master-Windows_8025.log.txt as an illustrative example of the failure, here are some key snippets and the associated lines from the test class... {noformat} # Previously: test individual adds, delById, and delByyQ using... # ... rf=3 with all replicas connected, # ... rf=2 when one replica's proxy is closed, # ... rf=1 when both replica's proxies are closed # Lines # 314-320 - "heal" the cluster (re-enable all proxies) ... [junit4] 2> 555732 INFO (TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [ ] o.a.s.c.AbstractFullDistribZkTestBase Found 3 replicas and leader on 127.0.0.1:59004_ for shard1 in repfacttest_c8n_1x3 [junit4] 2> 555732 INFO (TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [ ] o.a.s.c.AbstractFullDistribZkTestBase Took 7107.0 ms to see all replicas become active. ... # Lines # 322-326 - checks that (individual) add, delById & delByQ all get rf=3 # Lines # 328-341 - checks that (batched) add, delById & delByQ all get rf=3 # Line # 344 - close a proxy port (59108) again ... [junit4] 2> 556060 WARN (TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [ ] o.a.s.c.s.c.SocketProxy Closing 1 connections to: http://127.0.0.1:59108/, target: http://127.0.0.1:59109/ {noformat} At this point, the next thing in the test is to add a batch of documents (ids#15-29) while one replica is partitioned -- but I should point out that it's not immediately obvious to me if the {{(updateExecutor-1924-thread-4}} logging from the leader below (complaining about {{Connection refused:}} to port 59108 is *because* of the update sent my the client, or independently because of the HTTP2 connection management detecting that the proxy was closed... {noformat} # Lines # 346-355 - send our first "batch" (id#15-29) when cluster isn't "healed" [junit4] 2> 558074 ERROR (updateExecutor-1924-thread-4-processing-x:repfacttest_c8n_1x3_shard1_replica_n2 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) [n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 x:repfacttest_c8n_1x3_shard1_replica_n2 ] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ to http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ [junit4] 2> => java.io.IOException: java.net.ConnectException: Connection refused: no further information ... # ...there are more details about supressed exceptions # ...this ERROR repeats many times - evidently as the leader tries to reconnect... ... [junit4] 2> 560193 ERROR (updateExecutor-1924-thread-4-processing-x:repfacttest_c8n_1x3_shard1_replica_n2 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) [n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 x:repfacttest_c8n_1x3_shard1_replica_n2 ] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ to http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ [junit4] 2> => java.io.IOException: java.net.ConnectException: Connection refused: no further information ... # ... brief bit of path=/admin/metrics logging from both n:127.0.0.1:59004_ and n:127.0.0.1:59084_ # ... and some other MetricsHistoryHandler logging (from overseer?) about failing to talk to 127.0.0.1:59108 # ... but mostly lots of logging from the leader about not being able to connect to 127.0.0.1:59108 # live replica (port 59060) logs that it's added the 15 docs FROMLEADER, ... BUT... # ... same thread then logs jetty EofException: Reset cancel_stream_error # ... so aparently it added the docs but had a problem communicating that back to the leader # ... evidently because it took 30 seconds (QTime = 30013) and leader gave up (see below) [junit4] 2> 591364 INFO (qtp1520091886-5884) [n:127.0.0.1:59060_ c:repfacttest_c8n_1x3 s:shard1 r:core_node4 x:repfacttest_c8n_1x3_shard1_replica_n1 ] o.a.s.u.p.LogUpdateProcessorFactory [repfacttest_c8n_1x3_shard1_replica_n1] webapp= path=/update params={update.dist
[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades
[ https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13599: Description: We've started seeing some weirdly consistent (but not reliably reproducible) failures from ReplicationFactorTest when running on Uwe's Windows jenkins machines. The failures all seem to have started on June 22 -- when Uwe upgraded his Windows VMs to upgrade the Java version, but happen across all versions of java tested, and on both the master and branch_8x. While this test failed a total of 5 times, in different ways, on various jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds frequently fails anywhere from 1-4 additional times. All of these failures occur in the exact same place, with the exact same assertion: that the expected replicationFactor of 2 was not achieved, and an rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned off due to a closed proxy. In the handful of logs I've examined closely, the 2nd "live" replica does in fact log that it recieved & processed the update, but with a QTime of over 30 seconds, and it then it immediately logs an {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount of {{java.net.ConnectException: Connection refused: no further information}} regarding the replica that was partitioned off, before a second {{updateExecutor}} thread ultimately logs {{java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" replica. What makes this perplexing is that this is not the first time in the test that documents were added to this collection while one replica was partitioned off, but it is the first time that all 3 of the following are true _at the same time_: # the collection has recovered after some replicas were partitioned and re-connected # a batch of multiple documents is being added # one replica has been "re" partitioned. ...prior to the point when this failure happens, only individual document adds were tested while replicas where partitioned. Batches of adds were only tested when all 3 replicas were "live" after the proxies were re-opened and the collection had fully recovered. The failure also comes from the first update to happen after a replica's proxy port has been "closed" for the _second_ time. While this conflagration of events might concievible trigger some weird bug, what makes these failures _particularly_ perplexing is that: * the failures only happen on Windows * the failures only started after the Windows VM update on June-22. was: We've started seeing some weirdly consistent (but not reliably reproducible) failures from ReplicationFactorTest when running on Uwe's Windows jenkins machines. The failures all seem to have started on June 22 -- when Uwe upgraded his Windows VMs to upgrade the Java version, but happen across all versions of java tested, and on both the master and branch_8x. While this test failed a total of 5 times, in different ways, on various jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds frequently fails anywhere from 1-4 additional times. All of these failures occur in the exact same place, with the exact same assertion: that the expected replicationFactor of 2 was not achieved, and an rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned off due to a closed proxy. In the handful of logs I've examined closely, the 2nd "live" replica does in fact log that it recieved & processed the update, but with a QTime of over 30 seconds, and it then it immediately logs an {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount of {{java.net.ConnectException: Connection refused: no further information}} regarding the replica that was partitioned off, before a second {{updateExecutor}} thread ultimately logs {{java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" replica. > ReplicationFactorTest high failure rate on Windows jenkins VMs after > 2019-06-22 OS/java upgrades > > >
[jira] [Created] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades
Hoss Man created SOLR-13599: --- Summary: ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades Key: SOLR-13599 URL: https://issues.apache.org/jira/browse/SOLR-13599 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man We've started seeing some weirdly consistent (but not reliably reproducible) failures from ReplicationFactorTest when running on Uwe's Windows jenkins machines. The failures all seem to have started on June 22 -- when Uwe upgraded his Windows VMs to upgrade the Java version, but happen across all versions of java tested, and on both the master and branch_8x. While this test failed a total of 5 times, in different ways, on various jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds frequently fails anywhere from 1-4 additional times. All of these failures occur in the exact same place, with the exact same assertion: that the expected replicationFactor of 2 was not achieved, and an rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned off due to a closed proxy. In the handful of logs I've examined closely, the 2nd "live" replica does in fact log that it recieved & processed the update, but with a QTime of over 30 seconds, and it then it immediately logs an {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount of {{java.net.ConnectException: Connection refused: no further information}} regarding the replica that was partitioned off, before a second {{updateExecutor}} thread ultimately logs {{java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-12988) Known OpenJDK >= 11 SSL (TLSv1.3) bugs can cause problems with Solr
[ https://issues.apache.org/jira/browse/SOLR-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-12988. - Resolution: Workaround With the jenkins servers upgraded, and the new SSLTestConfig assumptions in place i haven't seen any (obvious) signs of any other openJDK related SSL bugs in the solr tests ... if more are identified we can update the issue description to list them here. I've also created SOLR-13594 to track the (eventual) need to enable SSL testing on java-13-ea once the known bugs are addressed (but fortunately, the way the supression logic is implemented, it explicitly checks for "ea" bbuilds ... so even if we never get a chance to proactively test on future java-13-ea builds, once java-13 final comes out, the tests _will_ try SSL on them automatically) > Known OpenJDK >= 11 SSL (TLSv1.3) bugs can cause problems with Solr > --- > > Key: SOLR-12988 > URL: https://issues.apache.org/jira/browse/SOLR-12988 > Project: Solr > Issue Type: Test >Reporter: Hoss Man >Assignee: Cao Manh Dat >Priority: Major > Labels: Java11, Java12, Java13 > Attachments: SOLR-12988.patch, SOLR-12988.patch, SOLR-12988.patch, > SOLR-13413.patch > > > There are several known OpenJDK JVM bugs (begining with Java11, when TLS v1.3 > support was first added) that are known to affect Solr's SSL support, and > have caused numerous test failures -- notably early "testing" builds of > OpenJDK 11, 12, & 13, as well as the officially released OpenJDK 11, 11.0.1, > and 11.0.2. > From the standpoint of the Solr project, there is very little we can do to > mitigate these bugs, and have taken steps to ensure any code using our > {{SSLTestConfig}} / {{RandomizeSSL}} test-framework classes will be "SKIPed" > with an {{AssumptionViolatedException}} when used on JVMs that are known to > be problematic. > Users who encounter any of the types of failures described below, or > developers who encounter test runs that "SKIP" with a message refering to > this issue ID, are encouraged to Upgrade their JVM. (or as a last resort: try > disabling "TLSv1.3" in your JVM security properties) > > Examples of known bugs as they have manifested in Solr tests... > * https://bugs.openjdk.java.net/browse/JDK-8212885 > ** "TLS 1.3 resumed session does not retain peer certificate chain" > ** affects users with {{checkPeerNames=true}} in your SSL configuration > ** causes 100% failure rate in Solr's > {{TestMiniSolrCloudClusterSSL.testSslWithCheckPeerName}} > ** can result in exceptions for SolrJ users, or in solr cloud server logs > when making intra-node requests, with a root cause of > "javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated" > ** {noformat} >[junit4] 2> Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer > not authenticated >[junit4] 2> at > java.base/sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:526) >[junit4] 2> at > org.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:464) >[junit4] 2> at > org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:397) >[junit4] 2> at > org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355) >[junit4] 2> at > org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) >[junit4] 2> at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359) >[junit4] 2> at > org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) >[junit4] 2> at > org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) >[junit4] 2> at > org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) >[junit4] 2> at > org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) >[junit4] 2> at > org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) >[junit4] 2> at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) >[junit4] 2> at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) >[junit4] 2> at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) >[junit4] 2> at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542) > {noformat} > * https://bugs.openjdk.java.net/browse/JDK-8213202 > ** "Possible race condition in TLS 1.3 session resumption" > ** May aff
[jira] [Created] (SOLR-13594) re-enable j13-ea SSL testing once known bugs are fixed
Hoss Man created SOLR-13594: --- Summary: re-enable j13-ea SSL testing once known bugs are fixed Key: SOLR-13594 URL: https://issues.apache.org/jira/browse/SOLR-13594 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Reporter: Hoss Man SOLR-12988 tracks several known bugs affecting SSL usage in OpenJDK java-13-ea and builds. at the moment, SSLTestConfig explicitly throws AssumptionViolatedException if it looks like tests are being run on _any_ java-13-ea build ... once the known bugs are addressed in OpenJDK, and new java-13-ea builds are released that address those bugs, we should patch SSLTestConfig to test those new EA builds, an assuming no other obvious bugs are identified: * update the logic in SSLTestConfig to suppress SSL testing only on the java-13-ea build#s known to be problematic. * update the jenkins JVMs to use the new EA builds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-13580) java 13 changes to locale specific Numeric parsing rules affect ParseNumeric UpdateProcessors when using 'local' config option - notably affects French
[ https://issues.apache.org/jira/browse/SOLR-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-13580: Description: Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the definition of language & locale specific formatting characters – in a non-backwards compatible way due to "French" changes in [CLDR 34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes] This impacts people who use any of the "ParseNumeric" UpdateProcessors in conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the (pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a "grouping" character (ie: between thousands and million, between millions and billions, etc...). Starting with java13 the JVM expects U+202F (NARROW NO BREAK SPACE) in it's place. Notably: upgrading to jdk13-ea+26 caused failures in Solr's ParsingFieldUpdateProcessorsTest which was initially had hardcoded test data that used U+00A0. ParsingFieldUpdateProcessorsTest has since been updated to account for this discrepency by modifying the test data used to determine the "expected" character for the current JVM, but there is nothing Solr or the ParseNumeric UpdateProcessors can do to help mitigate this change in behavior for end users who upgrade to java13. Affected users with U+00A0 characters in their incoming SolrInputDocuments will see the ParseNumeric UpdateProcessors (configured with locale=fr...) "skip" these values as unparsable, most likely resulting in a failure to index into a numeric field since the original "String" value will be left as is. Affected users may want to consider updating their configs to include a {{RegexReplaceProcessorFactory}} configured to strip out all whitespace characters, prior to any ParseNumeric update processors configured expect french langauge numbers was: Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the definition of language & locale specific formatting characters – in a non-backwards compatible way due to "French" changes in [CLDR 34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes] This impacts people who use any of the "ParseNumeric" UpdateProcessors in conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the (pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a "grouping" character (ie: between thousands and million, between millions and billions, etc...). Starting with java13 the JVM expects U+202F (NARROW NO BREAK SPACE) in it's place. Notably: upgrading to jdk13-ea+26 caused failures in Solr's ParsingFieldUpdateProcessorsTest which was initially had hardcoded test data that used U+00A0. ParsingFieldUpdateProcessorsTest has since been updated to account for this discrepency by modifying the test data used to determine the "expected" character for the current JVM, but there is nothing Solr or the ParseNumeric UpdateProcessors can do to help mitigate this change in behavior for end users who upgrade to java13. Affected users with U+00A0 characters in their incoming SolrInputDocuments will see the ParseNumeric UpdateProcessors (configured with locale=fr...) "skip" these values as unparsable, most likely resulting in a failure to index into a numeric field since the original "String" value will be left as is. update dsecription with a possible workaround that just occured to me > java 13 changes to locale specific Numeric parsing rules affect ParseNumeric > UpdateProcessors when using 'local' config option - notably affects French > > > Key: SOLR-13580 > URL: https://issues.apache.org/jira/browse/SOLR-13580 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Labels: Java13 > Attachments: SOLR-13580.patch > > > Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has > updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the > definition of language & locale specific formatting characters – in a > non-backwards compatible way due to "French" changes in [CLDR > 34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes] > This impacts people who use any of the "ParseNumeric" UpdateProcessors in > conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the > (pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a > "groupi
[jira] [Resolved] (SOLR-13580) java 13 changes to locale specific Numeric parsing rules affect ParseNumeric UpdateProcessors when using 'local' config option - notably affects French
[ https://issues.apache.org/jira/browse/SOLR-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-13580. - Resolution: Not A Bug > java 13 changes to locale specific Numeric parsing rules affect ParseNumeric > UpdateProcessors when using 'local' config option - notably affects French > > > Key: SOLR-13580 > URL: https://issues.apache.org/jira/browse/SOLR-13580 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Labels: Java13 > Attachments: SOLR-13580.patch > > > Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has > updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the > definition of language & locale specific formatting characters – in a > non-backwards compatible way due to "French" changes in [CLDR > 34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes] > This impacts people who use any of the "ParseNumeric" UpdateProcessors in > conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the > (pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a > "grouping" character (ie: between thousands and million, between millions and > billions, etc...). Starting with java13 the JVM expects U+202F (NARROW NO > BREAK SPACE) in it's place. > Notably: upgrading to jdk13-ea+26 caused failures in Solr's > ParsingFieldUpdateProcessorsTest which was initially had hardcoded test data > that used U+00A0. ParsingFieldUpdateProcessorsTest has since been updated to > account for this discrepency by modifying the test data used to determine the > "expected" character for the current JVM, but there is nothing Solr or the > ParseNumeric UpdateProcessors can do to help mitigate this change in behavior > for end users who upgrade to java13. > Affected users with U+00A0 characters in their incoming SolrInputDocuments > will see the ParseNumeric UpdateProcessors (configured with locale=fr...) > "skip" these values as unparsable, most likely resulting in a failure to > index into a numeric field since the original "String" value will be left as > is. > Affected users may want to consider updating their configs to include a > {{RegexReplaceProcessorFactory}} configured to strip out all whitespace > characters, prior to any ParseNumeric update processors configured expect > french langauge numbers > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org