[jira] [Commented] (LUCENE-8989) IndexSearcher Should Handle Rejection of Concurrent Task
[ https://issues.apache.org/jira/browse/LUCENE-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939127#comment-16939127 ] ASF subversion and git services commented on LUCENE-8989: - Commit 15db6bfa88952cf0912b3c93d59c0cdc55bf9e2a in lucene-solr's branch refs/heads/master from Atri Sharma [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=15db6bf ] LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution (#899) When executing queries using Executors, we should gracefully handle the case when Executor rejects a task and run the task on the caller thread > IndexSearcher Should Handle Rejection of Concurrent Task > > > Key: LUCENE-8989 > URL: https://issues.apache.org/jira/browse/LUCENE-8989 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > As discussed in [https://github.com/apache/lucene-solr/pull/815,] > IndexSearcher should handle the case when the executor rejects the execution > of a task (unavailability of threads?). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] atris merged pull request #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution
atris merged pull request #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution URL: https://github.com/apache/lucene-solr/pull/899 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution
atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution URL: https://github.com/apache/lucene-solr/pull/899#issuecomment-535786132 Merging now -- assuming lazy consensus This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #884: LUCENE-8980: optimise SegmentTermsEnum.seekExact performance
atris commented on issue #884: LUCENE-8980: optimise SegmentTermsEnum.seekExact performance URL: https://github.com/apache/lucene-solr/pull/884#issuecomment-535782672 +1, I think this is a good change. The numbers look fine. My only concern was that since this is in the hot path of indexing, additional CPU cycles will be spent in performing the check. However, no degradation seems to be reported in your benchmarks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Ethan-Zhang commented on issue #884: LUCENE-8980: optimise SegmentTermsEnum.seekExact performance
Ethan-Zhang commented on issue #884: LUCENE-8980: optimise SegmentTermsEnum.seekExact performance URL: https://github.com/apache/lucene-solr/pull/884#issuecomment-535771947 good work! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] yonik opened a new pull request #903: SOLR-13399: add SPLITSHARD splitByPrefix docs
yonik opened a new pull request #903: SOLR-13399: add SPLITSHARD splitByPrefix docs URL: https://github.com/apache/lucene-solr/pull/903 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
[ https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated LUCENE-8991: --- Attachment: LUCENE-8991.patch Status: Patch Available (was: Patch Available) I'll do you one better Erick... In the updated patch, {{-da:java.util.HashMap}} is used if and only if this appears to be openjdk based JVM with spec version 10 or 11 (same logic as used in the existing {{documentation-lint.supported}} condition) *and* {{tests.asserts.hashmap}} is "false" (or unset) ... so people can override it this logic with a single {{-Dtests.asserts.hashmap=true}}. which means on trunk today: 1) this line fails fairly reliably for me using openjdk 11.0.4... {noformat} ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8 {noformat} ..but with the patch applied it passes reliably for me, indicating that the disabling the assertions on HashMap prevented the failure. 2) with the patch applied *and* the override prop added... {noformat} ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.asserts.hashmap=true {noformat} ...it starts to fail reliably for me again, indicating that the override works > disable java.util.HashMap assertions to avoid spurious vailures due to > JDK-8205399 > -- > > Key: LUCENE-8991 > URL: https://issues.apache.org/jira/browse/LUCENE-8991 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Labels: Java10, Java11 > Attachments: LUCENE-8991.patch, LUCENE-8991.patch > > > An incredibly common class of jenkins failure (at least in Solr tests) stems > from triggering assertion failures in java.util.HashMap -- evidently > triggering bug JDK-8205399, first introduced in java-10, and fixed in > java-12, but has never been backported to any java-10 or java-11 bug fix > release... >https://bugs.openjdk.java.net/browse/JDK-8205399 > SOLR-13653 tracks how this bug can affect Solr users, but I think it would > make sense to disable java.util.HashMap in our build system to reduce the > confusing failures when users/jenkins runs tests, since there is nothing we > can do to work around this when testing with java-11 (or java-10 on branch_8x) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939010#comment-16939010 ] Mark Miller commented on SOLR-13796: So my custom gradle test run spits out tests that don't seem to match their annotation, and at this point it spits out 100+ of the below. All these super fast tests that could often easily break 30, 40+ seconds in CI (if it was a short test to begin with). {noformat} org.apache.solr.handler.PingRequestHandlerTest 3s @Slow Found very fast test annotated as slow! org.apache.solr.core.SolrCoreTest 5s @Slow Found very fast test annotated as slow! org.apache.solr.highlight.HighlighterTest 2s @Slow Found very fast test annotated as slow! org.apache.solr.search.stats.TestExactSharedStatsCache 4s @Slow Found very fast test annotated as slow! org.apache.solr.core.TestDynamicLoadingUrl 3s @Slow Found very fast test annotated as slow! org.apache.solr.cloud.rule.RulesTest 7s @Slow Found very fast test annotated as slow! {noformat} > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13797) SolrResourceLoader produces inconsistent results when given bad arguments
[ https://issues.apache.org/jira/browse/SOLR-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob updated SOLR-13797: - Status: Patch Available (was: Open) > SolrResourceLoader produces inconsistent results when given bad arguments > - > > Key: SOLR-13797 > URL: https://issues.apache.org/jira/browse/SOLR-13797 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2 >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Attachments: SOLR-13797.v1.patch > > > SolrResourceLoader will attempt to do some magic to infer what the user > wanted when loading TokenFilter and Tokenizer classes. However, this can end > up putting the wrong class in the cache such that the request succeeds the > first time but fails subsequent times. It should either succeed or fail > consistently on every call. > This can be triggered in a variety of ways, but the simplest is maybe by > specifying the wrong element type in an indexing chain. Consider the field > type definition: > {code:xml} > > > > maxGramSize="2"/> > > > {code} > If loaded by itself (e.g. docker container for standalone validation) then > the schema will pass and collection will succeed, with Solr actually figuring > out that it needs an {{NGramTokenFilterFactory}}. However, if this is loaded > on a cluster with other collections where the {{NGramTokenizerFactory}} has > been loaded correctly then we get {{ClassCastException}}. Or if this > collection is loaded first then others using the Tokenizer will fail instead. > I'd argue that succeeding on both calls is the better approach because it > does what the user likely wants instead of what the user explicitly asks for, > and creates a nicer user experience that is marginally less pedantic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13797) SolrResourceLoader produces inconsistent results when given bad arguments
[ https://issues.apache.org/jira/browse/SOLR-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob updated SOLR-13797: - Attachment: SOLR-13797.v1.patch Status: Open (was: Open) This is a patch that allows both calls to succeed by removing the bad value from the cache. If somebody has preferences to the strict approach, let me know. > SolrResourceLoader produces inconsistent results when given bad arguments > - > > Key: SOLR-13797 > URL: https://issues.apache.org/jira/browse/SOLR-13797 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.2, 7.7.2 >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Attachments: SOLR-13797.v1.patch > > > SolrResourceLoader will attempt to do some magic to infer what the user > wanted when loading TokenFilter and Tokenizer classes. However, this can end > up putting the wrong class in the cache such that the request succeeds the > first time but fails subsequent times. It should either succeed or fail > consistently on every call. > This can be triggered in a variety of ways, but the simplest is maybe by > specifying the wrong element type in an indexing chain. Consider the field > type definition: > {code:xml} > > > > maxGramSize="2"/> > > > {code} > If loaded by itself (e.g. docker container for standalone validation) then > the schema will pass and collection will succeed, with Solr actually figuring > out that it needs an {{NGramTokenFilterFactory}}. However, if this is loaded > on a cluster with other collections where the {{NGramTokenizerFactory}} has > been loaded correctly then we get {{ClassCastException}}. Or if this > collection is loaded first then others using the Tokenizer will fail instead. > I'd argue that succeeding on both calls is the better approach because it > does what the user likely wants instead of what the user explicitly asks for, > and creates a nicer user experience that is marginally less pedantic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-13797) SolrResourceLoader produces inconsistent results when given bad arguments
Mike Drob created SOLR-13797: Summary: SolrResourceLoader produces inconsistent results when given bad arguments Key: SOLR-13797 URL: https://issues.apache.org/jira/browse/SOLR-13797 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 8.2, 7.7.2 Reporter: Mike Drob Assignee: Mike Drob SolrResourceLoader will attempt to do some magic to infer what the user wanted when loading TokenFilter and Tokenizer classes. However, this can end up putting the wrong class in the cache such that the request succeeds the first time but fails subsequent times. It should either succeed or fail consistently on every call. This can be triggered in a variety of ways, but the simplest is maybe by specifying the wrong element type in an indexing chain. Consider the field type definition: {code:xml} {code} If loaded by itself (e.g. docker container for standalone validation) then the schema will pass and collection will succeed, with Solr actually figuring out that it needs an {{NGramTokenFilterFactory}}. However, if this is loaded on a cluster with other collections where the {{NGramTokenizerFactory}} has been loaded correctly then we get {{ClassCastException}}. Or if this collection is loaded first then others using the Tokenizer will fail instead. I'd argue that succeeding on both calls is the better approach because it does what the user likely wants instead of what the user explicitly asks for, and creates a nicer user experience that is marginally less pedantic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs
[ https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938996#comment-16938996 ] ASF subversion and git services commented on SOLR-13747: Commit e979255ca75bc554a75daeda523bb0b60ade39f2 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e979255 ] SOLR-13747: New TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs() to give people running tests more visibility if/when they use a known-buggy JVM causing most SSL tests to silently SKIP (cherry picked from commit ec9780c8aad7ffbf394d4cbefa772c6ba61650d0) > 'ant test' should fail on JVM's w/known SSL bugs > > > Key: SOLR-13747 > URL: https://issues.apache.org/jira/browse/SOLR-13747 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-13747.patch > > > If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has > known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this > -- so the user knows they should upgrade their JVM (rather then relying on > the user to notice that SSL tests were {{SKIP}} ed) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
[ https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938994#comment-16938994 ] Lucene/Solr QA commented on LUCENE-8991: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 0s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 0m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 0m 0s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:black}{color} | {color:black} {color} | {color:black} 0m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | LUCENE-8991 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12981459/LUCENE-8991.patch | | Optional Tests | compile javac unit ratsources validatesourcepatterns | | uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / ec9780c8aad | | ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 | | Default Java | LTS | | Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/207/testReport/ | | modules | C: lucene U: lucene | | Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/207/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > disable java.util.HashMap assertions to avoid spurious vailures due to > JDK-8205399 > -- > > Key: LUCENE-8991 > URL: https://issues.apache.org/jira/browse/LUCENE-8991 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Labels: Java10, Java11 > Attachments: LUCENE-8991.patch > > > An incredibly common class of jenkins failure (at least in Solr tests) stems > from triggering assertion failures in java.util.HashMap -- evidently > triggering bug JDK-8205399, first introduced in java-10, and fixed in > java-12, but has never been backported to any java-10 or java-11 bug fix > release... >https://bugs.openjdk.java.net/browse/JDK-8205399 > SOLR-13653 tracks how this bug can affect Solr users, but I think it would > make sense to disable java.util.HashMap in our build system to reduce the > confusing failures when users/jenkins runs tests, since there is nothing we > can do to work around this when testing with java-11 (or java-10 on branch_8x) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938991#comment-16938991 ] Noble Paul commented on SOLR-13661: --- bq.Simplicity should be front seat. Don't force users to have to add {{package="my-pkg"}} A name is something everyone needs . A package name is an extremely important aspect. You cannot load reload plugins from a package we do not know the name of the package. Just a class name means nothing. Isolated classloaders are extremely important. Every sensible platform is built with isolation. We can possibly later add a global feature called , {{config-package= mypkg}} . This would mean that every plugin will load from {{mypkg}}. There is no reason why it cannot be added later. But, not being able to load a plugins from multiple packages is a strict NO. You are just ensuring that multiple packages from multiple plugin writers cannot coexist. Another issue id backward incompatibility. The new class loader design is different and it can break current deploymemts This new design totally disallows per core classloaders. This can be a problem for users who use Solr with core level libs. So, it has to be backward incompatble as well bq.Robustness during upgrades is another concern. I don't see mentioned in the design doc what happens during a Solr upgrade. I'm not sure you have read the design properly. Robustness of design is the paramount feature of this design. We went out of our way to ensure that "update" is non-disruptive. Rolling restarts are NEVER REQUIRED. Solr behaves very badly during rolling restarts, we lose shards/replicas or our overseer get clogged with messages and needs manual intervention. Overall feedback on your feedback [~janhoy] I'm happy to receive and address feedback. We need to divide the following parts # Enhancements which can be implemented with the current design . But the current design is meeting the minimum viable product and does not obstruct or hamper implementing an enhancement # The current design is suboptimal or bad UX # Comments on "inadequate eyes" etc. We are an OSS project and we don't work in the same org. So collaboration is always "inadequate" . We should always work towards having 100% collaboration in every single feature that we build. The fact that we are discussing this here means that we can have collaboration as long as people are interested in it. So, please refrain from giving this "feedback". We should keep the comments short and sweet and make them "actionable" I'll give examples of #1 and #2 h3. using a Package.java to load plugins This falls into #1 Is this a good suggestion? I would say it's questionable. But, I won't delve into that here. But, can we implement it if required in the future? Yes, of course. Does it have to be discussed in this ticket? I would say it's out of the scope. h3. Changing the blob names to hash-filename This falls into #2 This is a usability issue. I had thought about this while building it, ab had raised the concern in he design doc. It is probably something we should address before committing this. If you have gone through the comments you would have seen it. Let's keep the feedback coming. But let's not distract from the current task at hand. We don't have a working useful package management system today and a minimum viable product is what we needs now. bells and whistles can be added later. We need the wheels, steering, transmission and seats of the car to be built first. Yes , we can add seat belts , Air conditioning etc later. > A package management system for Solr > > > Key: SOLR-13661 > URL: https://issues.apache.org/jira/browse/SOLR-13661 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: package > Attachments: plugin-usage.png, repos.png > > Time Spent: 20m > Remaining Estimate: 0h > > Here's the design doc: > https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938982#comment-16938982 ] Jan Høydahl edited comment on SOLR-13661 at 9/26/19 9:35 PM: - Thanks for all the careful responses Ishan, much of what you write makes a lot of sense. We disagree on others. You failed to address my point 5 on version dependency conflicts when (rolling) upgrading Solr. {quote}We have a disagreement here. Needing the user to edit solrconfig.xml necessarily in order to specify which collection uses which package is bad from simplicity standpoint. Please keep in mind that hand editing configset is an expert feature. No, we're not forcing regular users to do anything extra. Regular users should be using config APIs to register/deregister their plugins. Expert users are expert enough to just add an extra package name while they are hand editing their solrconfig.xml. I think this is a very reasonable compromise. {quote} My concern is partly ease of use, but also keeping a config set coherent. The dependency information that a certain collection will not work without e.g. Kuromoji analyzer, ICU4J and BMax Query parser belongs there with the config set, so that we can alert user when creating collection if a dependency is not met. Thus support for foo tags (or a variant of lib tag as David suggested) in solrconfig makes perfect sense. Collections are sometimes moved/copied between clusters. You could install a new cluster fresh, restore from backup etc. I'm not against a bin/solr package deploy command, but that command should then add the tag to the collection(s), not to some global config. 13. I thought of another case, namely backup/restore. I suppose that all package data from ZK will be backed up and restored. But will blobs be part of the backup? If not, how would a restore make sure that the blob store is re-populated? What about CDCR? Will it sync packages and blobs? What if the other cluster has another Solr version? :) I'd also vote for targeting 9.0 instead of 8.x. It would make for a great killer feature. To get the feature out in people's hands we could do an 9.0.0-beta release this fall, while continuing to release 8.4, 8.5 etc afterwards. Lots of people would start using the alpha release, including 3rd party plugin developers, and we'd get tons of feedback and time to adjust and stabilize the APIs. The 9.0.0-beta release could also be a public beta for the Gradle build which is another thing users (and integrators not the least) needs to adjust to. was (Author: janhoy): Thanks for all the careful responses Ishan, much of what you write makes a lot of sense. We disagree on others. You failed to address my point 5 on version dependency conflicts when (rolling) upgrading Solr. {quote}We have a disagreement here. Needing the user to edit solrconfig.xml necessarily in order to specify which collection uses which package is bad from simplicity standpoint. Please keep in mind that hand editing configset is an expert feature. No, we're not forcing regular users to do anything extra. Regular users should be using config APIs to register/deregister their plugins. Expert users are expert enough to just add an extra package name while they are hand editing their solrconfig.xml. I think this is a very reasonable compromise. {quote} My concern is partly ease of use, but also keeping a config set coherent. The dependency information that a certain collection will not work without e.g. Kuromoji analyzer, ICU4J and BMax Query parser belongs there with the config set, so that we can alert user when creating collection if a dependency is not met. Thus support for foo tags (or a variant of lib tag as David suggested) in solrconfig makes perfect sense. Collections are sometimes moved/copied between clusters. You could install a new cluster fresh, restore from backup etc. I'm not against a bin/solr package deploy command, but that command should then add the tag to the collection(s), not to some global config. 13. I thought of another case, namely backup/restore. I suppose that all package data from ZK will be backed up and restored. But will blobs be part of the backup? If not, how would a restore make sure that the blob store is re-populated? I'd also vote for targeting 9.0 instead of 8.x. It would make for a great killer feature. To get the feature out in people's hands we could do an 9.0.0-beta release this fall, while continuing to release 8.4, 8.5 etc afterwards. Lots of people would start using the alpha release, including 3rd party plugin developers, and we'd get tons of feedback and time to adjust and stabilize the APIs. The 9.0.0-beta release could also be a public beta for the Gradle build which is another thing users (and integrators not the least) needs to adjust to. > A package management system for Solr >
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938976#comment-16938976 ] Mark Miller commented on SOLR-13796: Here is a sample of the knobs we have currently, I'm sure there are others - we want to get tight control of these so we can have fast, consistent runs and runs not meant to be developer interactive (Nightly). {noformat} System.setProperty("solr.tests.IntegerFieldType", "solr.IntPointField"); System.setProperty("solr.tests.FloatFieldType", "solr.FloatPointField"); System.setProperty("solr.tests.LongFieldType", "solr.LongPointField"); System.setProperty("solr.tests.DoubleFieldType", "solr.DoublePointField"); System.setProperty("solr.tests.DateFieldType", "solr.DatePointField"); System.setProperty("solr.tests.EnumFieldType", "solr.EnumFieldType"); System.setProperty("solr.tests.numeric.dv", "true"); System.setProperty("solr.tests.numeric.points", "true"); System.setProperty("solr.tests.numeric.points.dv", "true"); System.setProperty("solr.iterativeMergeExecIdleTime", "1000"); System.setProperty("zookeeper.forceSync", "false"); System.setProperty("solr.zkclienttimeout", "9"); System.setProperty("solr.httpclient.retries", "1"); System.setProperty("solr.retries.on.forward", "1"); System.setProperty("solr.retries.to.followers", "1"); System.setProperty("solr.v2RealPath", "true"); System.setProperty("zookeeper.forceSync", "no"); System.setProperty("jetty.testMode", "true"); System.setProperty("enable.update.log", usually() ? "true" : "false"); System.setProperty("tests.shardhandler.randomSeed", Long.toString(random().nextLong())); System.setProperty("solr.clustering.enabled", "false"); System.setProperty("solr.peerSync.useRangeVersions", String.valueOf(random().nextBoolean())); System.setProperty("solr.cloud.wait-for-updates-with-stale-state-pause", "500"); System.setProperty(ZK_WHITELIST_PROPERTY, "*"); DirectUpdateHandler2.commitOnClose = false; // other tests turn this off and try to reset it - we use sys prop below to override System.setProperty("tests.disableHdfs", "true"); System.setProperty("solr.maxContainerThreads", "20"); System.setProperty("solr.lowContainerThreadsThreshold", "-1"); System.setProperty("solr.minContainerThreads", "0"); System.setProperty("solr.containerThreadsIdle", "3"); System.setProperty("evictIdleConnections", "2"); System.setProperty("solr.commitOnClose", "false"); // can make things quite slow System.setProperty("solr.codec", "solr.SchemaCodecFactory"); System.setProperty("tests.COMPRESSION_MODE", "BEST_COMPRESSION"); System.setProperty("tests.skipSetupCodec", "true"); System.setProperty("solr.lock.type", "single"); System.setProperty("solr.tests.lockType", "single"); System.setProperty("solr.tests.mergePolicyFactory", "org.apache.solr.index.NoMergePolicyFactory"); System.setProperty("solr.tests.mergeScheduler", "org.apache.lucene.index.ConcurrentMergeScheduler"); System.setProperty("solr.mscheduler", "org.apache.lucene.index.ConcurrentMergeScheduler"); System.setProperty("bucketVersionLockTimeoutMs", "8000"); System.setProperty("socketTimeout", "3"); System.setProperty("connTimeout", "1"); System.setProperty("solr.cloud.wait-for-updates-with-stale-state-pause", "0"); System.setProperty("solr.cloud.starting-recovery-delay-milli-seconds", "0"); System.setProperty("lucene.cms.override_core_count", "2"); System.setProperty("lucene.cms.override_spins", "false"); System.setProperty("solr.tests.maxBufferedDocs", "100"); System.setProperty("solr.tests.ramBufferSizeMB", "20"); System.setProperty("solr.tests.ramPerThreadHardLimitMB", "4"); System.setProperty("managed.schema.mutable", "false"); System.setProperty("solr.disableJvmMetrics", "true"); System.setProperty("useCompoundFile", "false"); System.setProperty("prepRecoveryReadTimeoutExtraWait", "2000"); System.setProperty("evictIdleConnections", "3"); System.setProperty("validateAfterInactivity", "-1"); System.setProperty("leaderVoteWait", "1000"); System.setProperty("leaderConflictResolveWait", "1000"); System.setProperty("solr.recovery.recoveryThrottle", "1000"); System.setProperty("solr.recovery.leaderThrottle", "500"); System.setProperty("solr.cloud.wait-for-updates-with-stale-state-pause", "0"); System.setProperty("solr.httpclient.retries", "1"); System.setProperty("solr.retries.on.forward", "1"); System.setProperty("solr.retries.to.followers", "1"); useFactory("solr.RAMDirectoryFactory"); {noformat} > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that
[jira] [Commented] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs
[ https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938974#comment-16938974 ] ASF subversion and git services commented on SOLR-13747: Commit ec9780c8aad7ffbf394d4cbefa772c6ba61650d0 in lucene-solr's branch refs/heads/master from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ec9780c ] SOLR-13747: New TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs() to give people running tests more visibility if/when they use a known-buggy JVM causing most SSL tests to silently SKIP > 'ant test' should fail on JVM's w/known SSL bugs > > > Key: SOLR-13747 > URL: https://issues.apache.org/jira/browse/SOLR-13747 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-13747.patch > > > If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has > known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this > -- so the user knows they should upgrade their JVM (rather then relying on > the user to notice that SSL tests were {{SKIP}} ed) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938973#comment-16938973 ] Mark Miller commented on SOLR-13796: Another thing I have to wrap up but pretty well covered, is collecting all of our system properties into the common base class. You cannot tell what is set now, what can be set, stuff is littered everywhere. I'd like to try and consolidate everything important in the base class, setting them all efficiently for non Nightly runs and more randomly and thoroughly for Nightly runs. > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
[ https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938955#comment-16938955 ] Erick Erickson commented on LUCENE-8991: Anything that improves test stability is welcome of course May I ask that you add the fact that it's been fixed in Java 12 to the comment in the code? Just so 3 years from now when someone looks at it after we go to JDK12+ as a minimum requirement it's obvious that it can be removed. > disable java.util.HashMap assertions to avoid spurious vailures due to > JDK-8205399 > -- > > Key: LUCENE-8991 > URL: https://issues.apache.org/jira/browse/LUCENE-8991 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Labels: Java10, Java11 > Attachments: LUCENE-8991.patch > > > An incredibly common class of jenkins failure (at least in Solr tests) stems > from triggering assertion failures in java.util.HashMap -- evidently > triggering bug JDK-8205399, first introduced in java-10, and fixed in > java-12, but has never been backported to any java-10 or java-11 bug fix > release... >https://bugs.openjdk.java.net/browse/JDK-8205399 > SOLR-13653 tracks how this bug can affect Solr users, but I think it would > make sense to disable java.util.HashMap in our build system to reduce the > confusing failures when users/jenkins runs tests, since there is nothing we > can do to work around this when testing with java-11 (or java-10 on branch_8x) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938948#comment-16938948 ] Mark Miller commented on SOLR-13796: So closing things has been a problem for a few reasons, especially when we have a lot of items to close. * Often the logic of what exceptions to ignore and ensuring proper flow through a close is complicated and buggy. * Any object in our complicated graph taking a long time to close can greatly affect other components * N and really slow down tests. * Having to wait for slow closes hides bugs and problems and monsters. * Often closes are implemented inefficiently, allowing greater time for ugly slow interactions when the system is in a partial closed state to kick off. In try to solve this issue I've created a new SmartClose class, you use it like: {noformat} try (SmartClose closer = new SmartClose(this) { closer.add(object1); closer.add("Group1", object2, exectutor1); closer.add("Group2", object3, () -> { weirdObject.shutdown(); return weirdObject; } ); } {noformat} When the SmartClose object closes, it will close each 'add' work group in parallel within the add but each add handled in order. So you don't have to handle much null logic, pass them straight to add null or not, you don't to have to handle exception logic, or handle efficiency logic. You just specify what has to be closed or shutdown, what can be don in parallel or has to be done in order (each add call) and the rest is handed for you. You also get nice automatic tracking so know how long closes are taking and what part of them is slow, eg: {noformat} org.apache.solr.core.CoreContainer 155ms : ZkContainer 44ms org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor 0ms org.apache.solr.cloud.ZkController 0ms : workExecutor & replayUpdateExec 1ms org.apache.solr.common.util.OrderedExecutor 0ms com.codahale.metrics.InstrumentedExecutorService 0ms : MetricsHistory & WaitForSolrCores 93ms org.apache.solr.core.SolrCores 93ms org.apache.solr.handler.admin.MetricsHistoryHandler 0ms org.apache.solr.client.solrj.impl.CloudSolrClient 0ms : Metrics reporters & guages 11ms org.apache.solr.metrics.SolrMetricManager:REP:NODE 11ms org.apache.solr.metrics.SolrMetricManager:REP:JVM 2ms org.apache.solr.metrics.SolrMetricManager:REP:JETTY 3ms org.apache.solr.metrics.SolrMetricManager:GA:JVM 0ms org.apache.solr.metrics.SolrMetricManager:GA:NODE 3ms org.apache.solr.metrics.SolrMetricManager:GA:JETTY 0ms : Final Items 2ms org.apache.solr.handler.component.HttpShardHandlerFactory 0ms org.apache.solr.update.UpdateShardHandler 0ms org.apache.solr.core.SolrResourceLoader 0ms org.apache.solr.metrics.SolrMetricManager:REP:CLUSTER 0ms org.apache.solr.handler.admin.CoreAdminHandler 0ms {noformat} > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938935#comment-16938935 ] Michael Sokolov commented on LUCENE-8920: - > Here is a proposal for the heuristic to select the encoding of a FST node. I like the overall structure of the proposal. I'm unsure about the proposed levels. For example, I believe the current FST does not have D1=0.66 rather it is 0.25. I'm not saying that's the _right_ choice, merely that I think it is what we have on master, and the discrepancy here makes me wonder if I'm reading the proposal correctly. How did you come up with the 0.66 number? > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938881#comment-16938881 ] Ishan Chattopadhyaya commented on SOLR-13661: - bq. Let's understand that this is not our hobby. It's a job. We are all able to do this because somebody is funding the development .When somebody is funding the development, they will have certain requirements and all their requirements need to be met. I'm sure every org works like that. Salesforce, Bloomberg, Apple , Lucidworks and all the big contributors are building big features to satisfy their businesses. I disagree with the spirit of this comment. I'd have done this work, irrespective of someone funding this development. For me, it is as much a hobby as much a job. > A package management system for Solr > > > Key: SOLR-13661 > URL: https://issues.apache.org/jira/browse/SOLR-13661 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: package > Attachments: plugin-usage.png, repos.png > > Time Spent: 20m > Remaining Estimate: 0h > > Here's the design doc: > https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863 ] Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:55 PM: -- 1. {quote}too many decisions seem to be made with too few eyes {quote} Noble, Ishan, Jan, Andrzej, David, Erick. We've had at least these many "eyes". How many more or who else are needed? 2. {quote}"package" concept seems to be designed for ONE use case only, customer's internal custom packages, with arbitrary local naming of repos and packages {quote} Not at all. Apache's contrib modules can easily be installed/deployed the same way, they can be packages themselves. End of the day, a "package" is just a jar and some metadata. A repository is just a named location containing packages. There is no "arbitrary local naming of repos and packages", those aspects are left to the plugin writer (which can be Apache as well in case of official contrib modules). {quote}before such a feature goes mainstream, the design should also include converting some of our contrib modules to packages that we release as separate binaries in the mirrors, and enable an "apache" Repo as default. {quote} Converting a contrib module needn't be a precursor to releasing this feature. Moreover, without the Gradle work completed, I don't want to attempt to change the build system too much (only to do it again later). The system design supports us doing so, and it can be taken up later. {quote}Perhaps that would mean some name spacing or name collision resolution {quote} Namespacing the packages is something we thought of. We already have two pieces of information available in public APIs and all system boundaries/interfaces: "repository" and "package-name". If we need to internally namespace the packages, we can do so later. However, I don't think we need to do this: look at dnf, apt etc. systems, and there's no concept of namespaces of packages. That means, third parties should be careful enough not to name their packages same as official packages. 3. Sure. {quote}We need a plan for how 3rd party plugin developers can publish their plugins on their own web site or on GitHub in a well defined way. {quote} I can add a document to that effect in our ref-guide. Initially, just the repository structure documented in the design document will be supported. Github support can be added subsequently. 4. {quote}Hot/Cold deploy. I don't like systems where you, as part of the install need to spin up a server. {quote} Spinning up a cluster prior to installing the packages is not "needed". Someone can cold start a cluster with plugins pre-installed. Noble and I have both documented those steps in the design doc as a reply to your comment. 6. That's a tradeoff. I initially raised the same point, but Noble suggested that adding a new set of znode watchers per collection is an overhead. I'm +0 on your suggestion. 7. We have a disagreement here. Needing the user to edit solrconfig.xml necessarily in order to specify which collection uses which package is bad from simplicity standpoint. Please keep in mind that hand editing configset is an expert feature. 8. No, we're not forcing regular users to do anything extra. Regular users should be using config APIs to register/deregister their plugins. Expert users are expert enough to just add an extra package name while they are hand editing their solrconfig.xml. I think this is a very reasonable compromise. 9. Plan is to support both: (a) jar without manifest + external manifest, (b) jar containing manifest. The first scenario should be preferred in case of multi-jar packages. 10. Sure, I like the idea of supporting multiple jars as well as a zip containing all the jars. It can be supported. 11. Plugin initialisation commands are not complex at all; they are typically just regular config API commands to register the plugins. They are necessary for pleasant adoption. A user just needs to install a package and specify which collection he needs to deploy his package to. Simple! If we go by your idea, then user would install the package but need to hand-edit the solrconfig.xml in order to add the plugins from the package to his collection. 12. I agree. We thought a lot about how to support filenames, and all the approaches seemed to have some deficiency. I've documented that as a comment in the design document. Without an additional distributed KV datastore, this seems hard to get right. The sha256.properties file approach will not work well, since every node will have its own version of the file, and also maintaining update consistency will be hard. {quote}But I put a lot of careful thought into the POC which I feel is largely lacking here {quote} I assure you that we've put a lot of thought into how we designed this. We went through deployment lifecycles of at
[jira] [Commented] (SOLR-13105) A visual guide to Solr Math Expressions and Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-13105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938877#comment-16938877 ] ASF subversion and git services commented on SOLR-13105: Commit 17b2308a17532202b62fee4234f4ed05703870e8 in lucene-solr's branch refs/heads/SOLR-13105-visual from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=17b2308 ] SOLR-13105: Update dsp docs 3 > A visual guide to Solr Math Expressions and Streaming Expressions > - > > Key: SOLR-13105 > URL: https://issues.apache.org/jira/browse/SOLR-13105 > Project: Solr > Issue Type: New Feature >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Attachments: Screen Shot 2019-01-14 at 10.56.32 AM.png, Screen Shot > 2019-02-21 at 2.14.43 PM.png, Screen Shot 2019-03-03 at 2.28.35 PM.png, > Screen Shot 2019-03-04 at 7.47.57 PM.png, Screen Shot 2019-03-13 at 10.47.47 > AM.png, Screen Shot 2019-03-30 at 6.17.04 PM.png > > > Visualization is now a fundamental element of Solr Streaming Expressions and > Math Expressions. This ticket will create a visual guide to Solr Math > Expressions and Solr Streaming Expressions that includes *Apache Zeppelin* > visualization examples. > It will also cover using the JDBC expression to *analyze* and *visualize* > results from any JDBC compliant data source. > Intro from the guide: > {code:java} > Streaming Expressions exposes the capabilities of Solr Cloud as composable > functions. These functions provide a system for searching, transforming, > analyzing and visualizing data stored in Solr Cloud collections. > At a high level there are four main capabilities that will be explored in the > documentation: > * Searching, sampling and aggregating results from Solr. > * Transforming result sets after they are retrieved from Solr. > * Analyzing and modeling result sets using probability and statistics and > machine learning libraries. > * Visualizing result sets, aggregations and statistical models of the data. > {code} > > A few sample visualizations are attached to the ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13105) A visual guide to Solr Math Expressions and Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-13105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938876#comment-16938876 ] ASF subversion and git services commented on SOLR-13105: Commit cc635233a614f62845f6361afbf9edb102bf3a04 in lucene-solr's branch refs/heads/SOLR-13105-visual from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cc63523 ] SOLR-13105: Update dsp docs 2 > A visual guide to Solr Math Expressions and Streaming Expressions > - > > Key: SOLR-13105 > URL: https://issues.apache.org/jira/browse/SOLR-13105 > Project: Solr > Issue Type: New Feature >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Attachments: Screen Shot 2019-01-14 at 10.56.32 AM.png, Screen Shot > 2019-02-21 at 2.14.43 PM.png, Screen Shot 2019-03-03 at 2.28.35 PM.png, > Screen Shot 2019-03-04 at 7.47.57 PM.png, Screen Shot 2019-03-13 at 10.47.47 > AM.png, Screen Shot 2019-03-30 at 6.17.04 PM.png > > > Visualization is now a fundamental element of Solr Streaming Expressions and > Math Expressions. This ticket will create a visual guide to Solr Math > Expressions and Solr Streaming Expressions that includes *Apache Zeppelin* > visualization examples. > It will also cover using the JDBC expression to *analyze* and *visualize* > results from any JDBC compliant data source. > Intro from the guide: > {code:java} > Streaming Expressions exposes the capabilities of Solr Cloud as composable > functions. These functions provide a system for searching, transforming, > analyzing and visualizing data stored in Solr Cloud collections. > At a high level there are four main capabilities that will be explored in the > documentation: > * Searching, sampling and aggregating results from Solr. > * Transforming result sets after they are retrieved from Solr. > * Analyzing and modeling result sets using probability and statistics and > machine learning libraries. > * Visualizing result sets, aggregations and statistical models of the data. > {code} > > A few sample visualizations are attached to the ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13105) A visual guide to Solr Math Expressions and Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-13105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938872#comment-16938872 ] ASF subversion and git services commented on SOLR-13105: Commit e1feb24c5e4a189b7c1cbbc2e2ee0523891dbe6f in lucene-solr's branch refs/heads/SOLR-13105-visual from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e1feb24 ] SOLR-13105: Update dsp docs > A visual guide to Solr Math Expressions and Streaming Expressions > - > > Key: SOLR-13105 > URL: https://issues.apache.org/jira/browse/SOLR-13105 > Project: Solr > Issue Type: New Feature >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Attachments: Screen Shot 2019-01-14 at 10.56.32 AM.png, Screen Shot > 2019-02-21 at 2.14.43 PM.png, Screen Shot 2019-03-03 at 2.28.35 PM.png, > Screen Shot 2019-03-04 at 7.47.57 PM.png, Screen Shot 2019-03-13 at 10.47.47 > AM.png, Screen Shot 2019-03-30 at 6.17.04 PM.png > > > Visualization is now a fundamental element of Solr Streaming Expressions and > Math Expressions. This ticket will create a visual guide to Solr Math > Expressions and Solr Streaming Expressions that includes *Apache Zeppelin* > visualization examples. > It will also cover using the JDBC expression to *analyze* and *visualize* > results from any JDBC compliant data source. > Intro from the guide: > {code:java} > Streaming Expressions exposes the capabilities of Solr Cloud as composable > functions. These functions provide a system for searching, transforming, > analyzing and visualizing data stored in Solr Cloud collections. > At a high level there are four main capabilities that will be explored in the > documentation: > * Searching, sampling and aggregating results from Solr. > * Transforming result sets after they are retrieved from Solr. > * Analyzing and modeling result sets using probability and statistics and > machine learning libraries. > * Visualizing result sets, aggregations and statistical models of the data. > {code} > > A few sample visualizations are attached to the ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938870#comment-16938870 ] Ishan Chattopadhyaya commented on SOLR-13661: - bq. We went through deployment lifecycles of at least 3 of my clients, went through state of the art package management systems like apt, dnf etc. and thought through all potential usecases and future of Solr as a lean core with all non-essential features stripped out as packages in an Apache repository), as well as went through your talk+code+presentation+document. Our main focus is ease of use and a robust plugin lifecycle management experience. Your PoC was lacking some major pieces that we've covered meticulously here: security, efficient loading (without needing to restart nodes as in your PoC), ease of deployment, not requiring packages to depend on PF4J (your PoC forced users to add PF4J as a dependency, perhaps to facilitate the loading) etc. Also, went through some bits of the user experience of ES plugin system. [~jpountz], [~jim.ferenczi], [~mikemccand], would love your thoughts on this, based on your experience with ES. > A package management system for Solr > > > Key: SOLR-13661 > URL: https://issues.apache.org/jira/browse/SOLR-13661 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: package > Attachments: plugin-usage.png, repos.png > > Time Spent: 20m > Remaining Estimate: 0h > > Here's the design doc: > https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863 ] Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:28 PM: -- 1. {quote}too many decisions seem to be made with too few eyes {quote} Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How many more or who else are needed? 2. {quote}"package" concept seems to be designed for ONE use case only, customer's internal custom packages, with arbitrary local naming of repos and packages {quote} Not at all. Apache's contrib modules can easily be installed/deployed the same way, they can be packages themselves. End of the day, a "package" is just a jar and some metadata. A repository is just a named location containing packages. There is no "arbitrary local naming of repos and packages", those aspects are left to the plugin writer (which can be Apache as well in case of official contrib modules). {quote}before such a feature goes mainstream, the design should also include converting some of our contrib modules to packages that we release as separate binaries in the mirrors, and enable an "apache" Repo as default. {quote} Converting a contrib module needn't be a precursor to releasing this feature. Moreover, without the Gradle work completed, I don't want to attempt to change the build system too much (only to do it again later). The system design supports us doing so, and it can be taken up later. {quote}Perhaps that would mean some name spacing or name collision resolution {quote} Namespacing the packages is something we thought of. We already have two pieces of information available in public APIs and all system boundaries/interfaces: "repository" and "package-name". If we need to internally namespace the packages, we can do so later. However, I don't think we need to do this: look at dnf, apt etc. systems, and there's no concept of namespaces of packages. That means, third parties should be careful enough not to name their packages same as official packages. 3. Sure. {quote}We need a plan for how 3rd party plugin developers can publish their plugins on their own web site or on GitHub in a well defined way. {quote} I can add a document to that effect in our ref-guide. Initially, just the repository structure documented in the design document will be supported. Github support can be added subsequently. 4. {quote}Hot/Cold deploy. I don't like systems where you, as part of the install need to spin up a server. {quote} Spinning up a cluster prior to installing the packages is not "needed". Someone can cold start a cluster with plugins pre-installed. Noble and I have both documented those steps in the design doc as a reply to your comment. 6. That's a tradeoff. I initially raised the same point, but Noble suggested that adding a new set of znode watchers per collection is an overhead. I'm +0 on your suggestion. 7. We have a disagreement here. Needing the user to edit solrconfig.xml necessarily in order to specify which collection uses which package is bad from simplicity standpoint. Please keep in mind that hand editing configset is an expert feature. 8. No, we're not forcing regular users to do anything extra. Regular users should be using config APIs to register/deregister their plugins. Expert users are expert enough to just add an extra package name while they are hand editing their solrconfig.xml. I think this is a very reasonable compromise. 9. Plan is to support both: (a) jar without manifest + external manifest, (b) jar containing manifest. The first scenario should be preferred in case of multi-jar packages. 10. Sure, I like the idea of supporting multiple jars as well as a zip containing all the jars. It can be supported. 11. Plugin initialisation commands are not complex at all; they are typically just regular config API commands to register the plugins. They are necessary for pleasant adoption. A user just needs to install a package and specify which collection he needs to deploy his package to. Simple! If we go by your idea, then user would install the package but need to hand-edit the solrconfig.xml in order to add the plugins from the package to his collection. 12. I agree. We thought a lot about how to support filenames, and all the approaches seemed to have some deficiency. I've documented that as a comment in the design document. Without an additional distributed KV datastore, this seems hard to get right. The sha256.properties file approach will not work well, since every node will have its own version of the file, and also maintaining update consistency will be hard. {quote}But I put a lot of careful thought into the POC which I feel is largely lacking here {quote} I assure you that we've put a lot of thought into how we designed this. We went through deployment lifecycles of at least 3
[jira] [Comment Edited] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863 ] Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:24 PM: -- 1. {quote}too many decisions seem to be made with too few eyes {quote} Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How many more or who else are needed? 2. {quote}"package" concept seems to be designed for ONE use case only, customer's internal custom packages, with arbitrary local naming of repos and packages {quote} Not at all. Apache's contrib modules can easily be installed/deployed the same way, they can be packages themselves. End of the day, a "package" is just a jar and some metadata. A repository is just a named location containing packages. There is no "arbitrary local naming of repos and packages", those aspects are left to the plugin writer (which can be Apache as well in case of official contrib modules). {quote}before such a feature goes mainstream, the design should also include converting some of our contrib modules to packages that we release as separate binaries in the mirrors, and enable an "apache" Repo as default. {quote} Converting a contrib module needn't be a precursor to releasing this feature. Moreover, without the Gradle work completed, I don't want to attempt to change the build system too much (only to do it again later). The system design supports us doing so, and it can be taken up later. {quote}Perhaps that would mean some name spacing or name collision resolution {quote} Namespacing the packages is something we thought of. We already have two pieces of information available in public APIs and all system boundaries/interfaces: "repository" and "package-name". If we need to internally namespace the packages, we can do so later. However, I don't think we need to do this: look at dnf, apt etc. systems, and there's no concept of namespaces of packages. That means, third parties should be careful enough not to name their packages same as official packages. 3. Sure. {quote}We need a plan for how 3rd party plugin developers can publish their plugins on their own web site or on GitHub in a well defined way. {quote} I can add a document to that effect in our ref-guide. Initially, just the repository structure documented in the design document will be supported. Github support can be added subsequently. 4. {quote}Hot/Cold deploy. I don't like systems where you, as part of the install need to spin up a server. {quote} Spinning up a cluster prior to installing the packages is not "needed". Someone can cold start a cluster with plugins pre-installed. Noble and I have both documented those steps in the design doc as a reply to your comment. 6. That's a tradeoff. I initially raised the same point, but Noble suggested that adding a new set of znode watchers per collection is an overhead. I'm +0 on your suggestion. 7. We have a disagreement here. Needing the user to edit solrconfig.xml necessarily in order to specify which collection uses which package is bad from simplicity standpoint. Please keep in mind that hand editing configset is an expert feature. 8. No, we're not forcing regular users to do anything extra. Regular users should be using config APIs to register/deregister their plugins. Expert users are expert enough to just add an extra package name while they are hand editing their solrconfig.xml. I think this is a very reasonable compromise. 9. Plan is to support both: (a) jar without manifest + external manifest, (b) jar containing manifest. The first scenario should be preferred in case of multi-jar packages. 10. Sure, I like the idea of supporting multiple jars as well as a zip containing all the jars. It can be supported. 11. Plugin initialisation commands are not complex at all; they are typically just regular config API commands to register the plugins. They are necessary for pleasant adoption. A user just needs to install a package and specify which collection he needs to deploy his package to. Simple! If we go by your idea, then user would install the package but need to hand-edit the solrconfig.xml in order to add the plugins from the package to his collection. 12. I agree. We thought a lot about how to support filenames, and all the approaches seemed to have some deficiency. I've documented that as a comment in the design document. Without an additional distributed KV datastore, this seems hard to get right. The sha256.properties file approach will not work well, since every node will have its own version of the file, and also maintaining update consistency will be hard. {quote}But I put a lot of careful thought into the POC which I feel is largely lacking here {quote} I assure you that we've put a lot of thought into how we designed this. Our main focus is ease of use and a robust plugin
[jira] [Comment Edited] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863 ] Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:24 PM: -- 1. {quote}too many decisions seem to be made with too few eyes {quote} Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How many more or who else are needed? 2. {quote}"package" concept seems to be designed for ONE use case only, customer's internal custom packages, with arbitrary local naming of repos and packages {quote} Not at all. Apache's contrib modules can easily be installed/deployed the same way, they can be packages themselves. End of the day, a "package" is just a jar and some metadata. A repository is just a named location containing packages. There is no "arbitrary local naming of repos and packages", those aspects are left to the plugin writer (which can be Apache as well in case of official contrib modules). {quote}before such a feature goes mainstream, the design should also include converting some of our contrib modules to packages that we release as separate binaries in the mirrors, and enable an "apache" Repo as default. {quote} Converting a contrib module needn't be a precursor to releasing this feature. Moreover, without the Gradle work completed, I don't want to attempt to change the build system too much (only to do it again later). {quote}Perhaps that would mean some name spacing or name collision resolution {quote} Namespacing the packages is something we thought of. We already have two pieces of information available in public APIs and all system boundaries/interfaces: "repository" and "package-name". If we need to internally namespace the packages, we can do so later. However, I don't think we need to do this: look at dnf, apt etc. systems, and there's no concept of namespaces of packages. That means, third parties should be careful enough not to name their packages same as official packages. 3. Sure. {quote}We need a plan for how 3rd party plugin developers can publish their plugins on their own web site or on GitHub in a well defined way. {quote} I can add a document to that effect in our ref-guide. Initially, just the repository structure documented in the design document will be supported. Github support can be added subsequently. 4. {quote}Hot/Cold deploy. I don't like systems where you, as part of the install need to spin up a server. {quote} Spinning up a cluster prior to installing the packages is not "needed". Someone can cold start a cluster with plugins pre-installed. Noble and I have both documented those steps in the design doc as a reply to your comment. 6. That's a tradeoff. I initially raised the same point, but Noble suggested that adding a new set of znode watchers per collection is an overhead. I'm +0 on your suggestion. 7. We have a disagreement here. Needing the user to edit solrconfig.xml necessarily in order to specify which collection uses which package is bad from simplicity standpoint. Please keep in mind that hand editing configset is an expert feature. 8. No, we're not forcing regular users to do anything extra. Regular users should be using config APIs to register/deregister their plugins. Expert users are expert enough to just add an extra package name while they are hand editing their solrconfig.xml. I think this is a very reasonable compromise. 9. Plan is to support both: (a) jar without manifest + external manifest, (b) jar containing manifest. The first scenario should be preferred in case of multi-jar packages. 10. Sure, I like the idea of supporting multiple jars as well as a zip containing all the jars. It can be supported. 11. Plugin initialisation commands are not complex at all; they are typically just regular config API commands to register the plugins. They are necessary for pleasant adoption. A user just needs to install a package and specify which collection he needs to deploy his package to. Simple! If we go by your idea, then user would install the package but need to hand-edit the solrconfig.xml in order to add the plugins from the package to his collection. 12. I agree. We thought a lot about how to support filenames, and all the approaches seemed to have some deficiency. I've documented that as a comment in the design document. Without an additional distributed KV datastore, this seems hard to get right. The sha256.properties file approach will not work well, since every node will have its own version of the file, and also maintaining update consistency will be hard. {quote}But I put a lot of careful thought into the POC which I feel is largely lacking here {quote} I assure you that we've put a lot of thought into how we designed this. Our main focus is ease of use and a robust plugin lifecycle management experience. Your PoC was lacking some major pieces
[GitHub] [lucene-solr] diegoceccarelli commented on issue #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch)
diegoceccarelli commented on issue #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch) URL: https://github.com/apache/lucene-solr/pull/300#issuecomment-535627929 Hi @cpoerschke! I think I addressed your comments, please let me know if I missed anything! Overview of the changes: - I found an explanation for the mystery of the two test failing and I fixed them. - Added checks for `numFound == 1` - Improved documentation - Forbid `group.func` and `group.query` and documented it. - Fix issues with `maxScore`, added tests. I have fixed some issues with distribute `maxScore` that are not related to the patch but I needed to fix for tests and I think I'm going to move them into a separate PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863 ] Ishan Chattopadhyaya commented on SOLR-13661: - 1. {quote}too many decisions seem to be made with too few eyes {quote} Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How many more or who else are needed? 2. {quote}"package" concept seems to be designed for ONE use case only, customer's internal custom packages, with arbitrary local naming of repos and packages {quote} Not at all. Apache's contrib modules can easily be installed/deployed the same way, they can be packages themselves. {quote}before such a feature goes mainstream, the design should also include converting some of our contrib modules to packages that we release as separate binaries in the mirrors, and enable an "apache" Repo as default. {quote} Converting a contrib module needn't be a precursor to releasing this feature. Moreover, without the Gradle work completed, I don't want to attempt to change the build system too much (only to do it again later). {quote}Perhaps that would mean some name spacing or name collision resolution {quote} Namespacing the packages is something we thought of. We already have two pieces of information available in public APIs and all system boundaries/interfaces: "repository" and "package-name". If we need to internally namespace the packages, we can do so later. However, I don't think we need to do this: look at dnf, apt etc. systems, and there's no concept of namespaces of packages. That means, third parties should be careful enough not to name their packages same as official packages. 3. Sure. {quote}We need a plan for how 3rd party plugin developers can publish their plugins on their own web site or on GitHub in a well defined way. {quote} I can add a document to that effect in our ref-guide. Initially, just the repository structure documented in the design document will be supported. Github support can be added subsequently. 4. {quote}Hot/Cold deploy. I don't like systems where you, as part of the install need to spin up a server. {quote} Spinning up a cluster prior to installing the packages is not "needed". Someone can cold start a cluster with plugins pre-installed. Noble and I have both documented those steps in the design doc as a reply to your comment. 6. That's a tradeoff. I initially raised the same point, but Noble suggested that adding a new set of znode watchers per collection is an overhead. I'm +0 on your suggestion. 7. We have a disagreement here. Needing the user to edit solrconfig.xml necessarily in order to specify which collection uses which package is bad from simplicity standpoint. Please keep in mind that hand editing configset is an expert feature. 8. No, we're not forcing regular users to do anything extra. Regular users should be using config APIs to register/deregister their plugins. Expert users are expert enough to just add an extra package name while they are hand editing their solrconfig.xml. I think this is a very reasonable compromise. 9. Plan is to support both: (a) jar without manifest + external manifest, (b) jar containing manifest. The first scenario should be preferred in case of multi-jar packages. 10. Sure, I like the idea of supporting multiple jars as well as a zip containing all the jars. It can be supported. 11. Plugin initialisation commands are not complex at all; they are typically just regular config API commands to register the plugins. They are necessary for pleasant adoption. A user just needs to install a package and specify which collection he needs to deploy his package to. Simple! If we go by your idea, then user would install the package but need to hand-edit the solrconfig.xml in order to add the plugins from the package to his collection. 12. I agree. We thought a lot about how to support filenames, and all the approaches seemed to have some deficiency. I've documented that as a comment in the design document. Without an additional distributed KV datastore, this seems hard to get right. The sha256.properties file approach will not work well, since every node will have its own version of the file, and also maintaining update consistency will be hard. {quote}But I put a lot of careful thought into the POC which I feel is largely lacking here {quote} I assure you that we've put a lot of thought into how we designed this. Our main focus is ease of use and a robust plugin lifecycle management experience. Your PoC was lacking some major pieces that we've covered meticulously here: security, efficient loading (without needing to restart nodes as in your PoC), ease of deployment, not requiring packages to depend on PF4J (your PoC forced users to add PF4J as a dependency, perhaps to facilitate the loading) etc. Just because I haven't documented each and every user story that you
[GitHub] [lucene-solr] diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch)
diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch) URL: https://github.com/apache/lucene-solr/pull/300#discussion_r328753104 ## File path: lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java ## @@ -139,10 +139,18 @@ public ScoreMode scoreMode() { // System.out.println(" group=" + (group.groupValue == null ? "null" : group.groupValue.toString())); SearchGroup searchGroup = new SearchGroup<>(); searchGroup.groupValue = group.groupValue; + // We pass this around so that we can get the corresponding solr id when serializing the search group to send to the federator + searchGroup.topDocLuceneId = group.topDoc; searchGroup.sortValues = new Object[sortFieldCount]; for(int sortFieldIDX=0;sortFieldIDX
[GitHub] [lucene-solr] diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch)
diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch) URL: https://github.com/apache/lucene-solr/pull/300#discussion_r328753104 ## File path: lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java ## @@ -139,10 +139,18 @@ public ScoreMode scoreMode() { // System.out.println(" group=" + (group.groupValue == null ? "null" : group.groupValue.toString())); SearchGroup searchGroup = new SearchGroup<>(); searchGroup.groupValue = group.groupValue; + // We pass this around so that we can get the corresponding solr id when serializing the search group to send to the federator + searchGroup.topDocLuceneId = group.topDoc; searchGroup.sortValues = new Object[sortFieldCount]; for(int sortFieldIDX=0;sortFieldIDX
[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
[ https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated LUCENE-8991: --- Status: Patch Available (was: Open) > disable java.util.HashMap assertions to avoid spurious vailures due to > JDK-8205399 > -- > > Key: LUCENE-8991 > URL: https://issues.apache.org/jira/browse/LUCENE-8991 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Attachments: LUCENE-8991.patch > > > An incredibly common class of jenkins failure (at least in Solr tests) stems > from triggering assertion failures in java.util.HashMap -- evidently > triggering bug JDK-8205399, first introduced in java-10, and fixed in > java-12, but has never been backported to any java-10 or java-11 bug fix > release... >https://bugs.openjdk.java.net/browse/JDK-8205399 > SOLR-13653 tracks how this bug can affect Solr users, but I think it would > make sense to disable java.util.HashMap in our build system to reduce the > confusing failures when users/jenkins runs tests, since there is nothing we > can do to work around this when testing with java-11 (or java-10 on branch_8x) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
[ https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated LUCENE-8991: --- Description: An incredibly common class of jenkins failure (at least in Solr tests) stems from triggering assertion failures in java.util.HashMap -- evidently triggering bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has never been backported to any java-10 or java-11 bug fix release... https://bugs.openjdk.java.net/browse/JDK-8205399 SOLR-13653 tracks how this bug can affect Solr users, but I think it would make sense to disable java.util.HashMap in our build system to reduce the confusing failures when users/jenkins runs tests, since there is nothing we can do to work around this when testing with java-11 (or java-10 on branch_8x) was: An incredibly common class of jenkins failure (at least in Solr tests) stems from triggering assertion failures in java.util.HashMap -- evidently triggering bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has never been backported to any java-10 or java-11 bug fix release... https://bugs.openjdk.java.net/browse/JDK-8205399 SOLR-13653 tracks how this bug can affect Solr users, but I think it would make sense to disable java.util.HashMap in our build system to reduce the confusing failures when users/jenkins runs tests, since there is nothing we can do to work around this when testing with java-11 (or java-10 on branch_8x) Labels: Java10 Java11 (was: ) > disable java.util.HashMap assertions to avoid spurious vailures due to > JDK-8205399 > -- > > Key: LUCENE-8991 > URL: https://issues.apache.org/jira/browse/LUCENE-8991 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Labels: Java10, Java11 > Attachments: LUCENE-8991.patch > > > An incredibly common class of jenkins failure (at least in Solr tests) stems > from triggering assertion failures in java.util.HashMap -- evidently > triggering bug JDK-8205399, first introduced in java-10, and fixed in > java-12, but has never been backported to any java-10 or java-11 bug fix > release... >https://bugs.openjdk.java.net/browse/JDK-8205399 > SOLR-13653 tracks how this bug can affect Solr users, but I think it would > make sense to disable java.util.HashMap in our build system to reduce the > confusing failures when users/jenkins runs tests, since there is nothing we > can do to work around this when testing with java-11 (or java-10 on branch_8x) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
[ https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated LUCENE-8991: --- Attachment: LUCENE-8991.patch Status: Open (was: Open) FYI: I tried asking Rory about the open-jdk backporting process and why the fix for JDK-8205399 had never been backported for inclusion in 11.0.2 or 11.0.3 (or at this point 11.0.4) given how long the issue is been known relative to when those releases came out, but never got a response... http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/%3calpine.DEB.2.11.1907251029100.10893@tray%3e An example of how this type of bug can manifest in our tests, from a recent jenkins failure... {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 13.4s J1 | TestCloudJSONFacetSKG.testRandom <<< [junit4]> Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:43673/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection: Expected mime type application/octet-stream but got text/html. [junit4]> [junit4]> [junit4]> Error 500 Server Error [junit4]> [junit4]> HTTP ERROR 500 [junit4]> Problem accessing /solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection/select. Reason: [junit4]> Server ErrorCaused by:java.lang.AssertionError [junit4]>at java.base/java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1896) [junit4]>at java.base/java.util.HashMap$TreeNode.putTreeVal(HashMap.java:2061) [junit4]>at java.base/java.util.HashMap.putVal(HashMap.java:633) [junit4]>at java.base/java.util.HashMap.put(HashMap.java:607) [junit4]>at org.apache.solr.search.LRUCache.putCacheValue(LRUCache.java:295) [junit4]>at org.apache.solr.search.LRUCache.put(LRUCache.java:268) [junit4]>at org.apache.solr.search.SolrCacheHolder.put(SolrCacheHolder.java:92) {noformat} TestCloudJSONFacetSKG seems to trigger this assertion bug a lot, but I've also seen it pop up in other tests. I haven't really dug into the details of JDK-8205399, but I suspect the size/balancing/rebalancing of the HashMap is what tickles the affected code path, so (i guess) tests that result in largeish HashMaps seem more likely to trigger it? The attached path is a simple change to lucene/common-build.xml to modify our existing use of {{"\-ea \-esa"}} into {{"\-ea \-esa \-da:java.util.HashMap"}} Any objections? > disable java.util.HashMap assertions to avoid spurious vailures due to > JDK-8205399 > -- > > Key: LUCENE-8991 > URL: https://issues.apache.org/jira/browse/LUCENE-8991 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Attachments: LUCENE-8991.patch > > > An incredibly common class of jenkins failure (at least in Solr tests) stems > from triggering assertion failures in java.util.HashMap -- evidently > triggering bug JDK-8205399, first introduced in java-10, and fixed in > java-12, but has never been backported to any java-10 or java-11 bug fix > release... >https://bugs.openjdk.java.net/browse/JDK-8205399 > SOLR-13653 tracks how this bug can affect Solr users, but I think it would > make sense to disable java.util.HashMap in our build system to reduce the > confusing failures when users/jenkins runs tests, since there is nothing we > can do to work around this when testing with java-11 (or java-10 on branch_8x) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
Chris M. Hostetter created LUCENE-8991: -- Summary: disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399 Key: LUCENE-8991 URL: https://issues.apache.org/jira/browse/LUCENE-8991 Project: Lucene - Core Issue Type: Bug Reporter: Chris M. Hostetter An incredibly common class of jenkins failure (at least in Solr tests) stems from triggering assertion failures in java.util.HashMap -- evidently triggering bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has never been backported to any java-10 or java-11 bug fix release... https://bugs.openjdk.java.net/browse/JDK-8205399 SOLR-13653 tracks how this bug can affect Solr users, but I think it would make sense to disable java.util.HashMap in our build system to reduce the confusing failures when users/jenkins runs tests, since there is nothing we can do to work around this when testing with java-11 (or java-10 on branch_8x) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13661) A package management system for Solr
[ https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938830#comment-16938830 ] Jan Høydahl commented on SOLR-13661: I have just looked at some of the code and will not have time for a more thorough review until week after next. Here is a list of my main concerns so far: # My main concern is that too many decisions seem to be made with too few eyes, combined with a goal of merging very soon. # One example of "too few eyes" is that the "package" concept seems to be designed for ONE use case only, customer's internal custom packages, with arbitrary local naming of repos and packages. I think before such a feature goes mainstream, the design should also include converting some of our contrib modules to packages that we release as separate binaries in the mirrors, and enable an "apache" Repo as default. That requires some more thought behind stable name-spacing, so that e.g. “bin/solr install ltr” will mean the same for all customers. Perhaps that would mean some name spacing or name collision resolution, so if you have a custom local repo with a package also called "ltr", then you get an error which can be resolved by qualifying the package name like e.g. "apache:ltr" or "mylocalrepo:ltr". # We need a plan for how 3rd party plugin developers can publish their plugins on their own web site or on GitHub in a well defined way. The use of pf4j-update lib takes care of much of this, and this is also something that can be added incrementally, but the design needs to allow for this. My POC has a RepositoryFactory class that parses the repo URL (e.g. "bin/solr plugin repo add myrepo [https://host.com/repo/name];) and selects the GitHubUpdateRepository if it is a GitHub URL, the ApacheMirrorsUpdateRepository if it is an apache.org address, and the default site/FS repo else. Each of these handle the download process and signature verification in a different way. # Hot/Cold deploy. I don't like systems where you, as part of the install need to spin up a server. We already have this with setting urlScheme in ZK for HTTPS. But ideally it should be possible to do a Solr install including plugins before you need to spin up Solr. Elasticsearch uses such a static plugin installer (but also don't support hot install). Having a "staging" folder where you can drop package ZIP files (or JARs) where the node can self-install packages during first startup could be one way to handle this. # Robustness during upgrades is another concern. I don't see mentioned in the design doc what happens during a Solr upgrade. We should think through the scenario for both minor and major version upgrade for Solr, and then I mean rolling upgrade. Having ZK as only master for what version of a plugin should be used is probably not sufficient, as during a rolling Solr upgrade, you could have one node on 8.3 and another node on, say, 9.0. And you could have packageA:1.0 installed but Solr9 requires v2.0 due to removal of some APIs or what not. In the cold scenario (as in POC) you'd shut down a Solr node, upgrade Solr, then run "bin/solr plugin upgrade outdated" before starting that node again, and that would make sure it has the correct plugin version. Since you cannot upgrade Solr while it is running, perhaps we need to hook in some validation on node startup that it does not have any packages that won't work with that Solr version, and refuse to start. And some way to have two versions of a package installed at the same time, and then instead of using the latest, the Solr node will select the newest version that is compatible. Then when that node is upgraded it will select the new version of the plugin automatically based on Version.java. # Package system deserves its own Znode in Zookeeper instead of abusing clusterProps # I don't like the concept of an admin needing to "deploy" a package to a collection using a command. Rather, the collections should require a set of packages (optionally with min version) and fail to start if it is not available in the system. If the package is available in the system, the collection should gain access to the package(s) it required without running a deploy command. # Simplicity should be front seat. Don't force users to have to add {{package="my-pkg"}} wherever they today can say {{class="com.example.MyPlugin"}}. This is what we have ResourceLoader and class loaders for. If we cannot find {{com.example.MyPlugin}} in main class loader, then hunt through every package class loader until you have a match, if no match then throw ClassNotFound. (I never liked the {{runtimeLib=true}} equivalent in the old blob store.) # The package design says that a manifest is not required for a package and that any plain jar can function as a package just by registering it manually. That is ok as an alternative workflow, but most packages (and all
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938818#comment-16938818 ] Mark Miller commented on SOLR-13796: Some other items: * Moving everything off the old slow base SolrCloud tests. * Removing old tests that are either silly now and/or little to zero value for their cost. * General clean up and sensible cob web eating. > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document
[ https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938764#comment-16938764 ] Ignacio Vera commented on LUCENE-8990: -- Thanks Atri! > IndexOrDocValuesQuery can take a bad decision for range queries if field has > many values per document > - > > Key: LUCENE-8990 > URL: https://issues.apache.org/jira/browse/LUCENE-8990 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ignacio Vera >Priority: Major > > Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range > queries . The leadCost that is provided is based on number of documents, > meanwhile the cost() of a range query is based on the number of points that > potentially match the query. > Therefore it might happen that a BKD tree has millions of points but this > points correspond to just a few documents. Therefore we can take the decision > of executing the query using docValues and in fact we are almost scanning all > the points. > Maybe the cost() function for range queries need to take into account the > average number of points per document in the tree and adjust the value > accordingly. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document
[ https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938748#comment-16938748 ] Ignacio Vera edited comment on LUCENE-8990 at 9/26/19 4:00 PM: --- I was thinking more something like: {code:java} double pointsPerDoc = values.size() / values.getDocCount(); values.estimatePointCount(visitor) / pointsPerDoc;{code} Maybe that can be abstracted out as a new method in PointValues like {{estimateDocCount()}}. was (Author: ivera): I was thinking more something like: {code:java} double pointsPerDoc = values.size() / values.getDocCount(); values.estimatePointCount(visitor) / pointsPerDoc;{code} Maybe that can be abstracted out as a new method in the IntersectVisitor like {{estimateDocCount()}}. > IndexOrDocValuesQuery can take a bad decision for range queries if field has > many values per document > - > > Key: LUCENE-8990 > URL: https://issues.apache.org/jira/browse/LUCENE-8990 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ignacio Vera >Priority: Major > > Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range > queries . The leadCost that is provided is based on number of documents, > meanwhile the cost() of a range query is based on the number of points that > potentially match the query. > Therefore it might happen that a BKD tree has millions of points but this > points correspond to just a few documents. Therefore we can take the decision > of executing the query using docValues and in fact we are almost scanning all > the points. > Maybe the cost() function for range queries need to take into account the > average number of points per document in the tree and adjust the value > accordingly. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document
[ https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938750#comment-16938750 ] Atri Sharma commented on LUCENE-8990: - +1 I am happy to take a crack at this if you are not planning to do so. > IndexOrDocValuesQuery can take a bad decision for range queries if field has > many values per document > - > > Key: LUCENE-8990 > URL: https://issues.apache.org/jira/browse/LUCENE-8990 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ignacio Vera >Priority: Major > > Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range > queries . The leadCost that is provided is based on number of documents, > meanwhile the cost() of a range query is based on the number of points that > potentially match the query. > Therefore it might happen that a BKD tree has millions of points but this > points correspond to just a few documents. Therefore we can take the decision > of executing the query using docValues and in fact we are almost scanning all > the points. > Maybe the cost() function for range queries need to take into account the > average number of points per document in the tree and adjust the value > accordingly. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document
[ https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938748#comment-16938748 ] Ignacio Vera commented on LUCENE-8990: -- I was thinking more something like: {code:java} double pointsPerDoc = values.size() / values.getDocCount(); values.estimatePointCount(visitor) / pointsPerDoc;{code} Maybe that can be abstracted out as a new method in the IntersectVisitor like {{estimateDocCount()}}. > IndexOrDocValuesQuery can take a bad decision for range queries if field has > many values per document > - > > Key: LUCENE-8990 > URL: https://issues.apache.org/jira/browse/LUCENE-8990 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ignacio Vera >Priority: Major > > Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range > queries . The leadCost that is provided is based on number of documents, > meanwhile the cost() of a range query is based on the number of points that > potentially match the query. > Therefore it might happen that a BKD tree has millions of points but this > points correspond to just a few documents. Therefore we can take the decision > of executing the query using docValues and in fact we are almost scanning all > the points. > Maybe the cost() function for range queries need to take into account the > average number of points per document in the tree and adjust the value > accordingly. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938742#comment-16938742 ] Mark Miller commented on SOLR-13796: I would also like to start enforcing limits on non Nightly tests - runtimes that exceed a certain fairly low threshold will start failing tests and suggesting alternatives or Nightly. Close times of critical components will also be instrumented and tracked and limited for reasonable times. There is also a new, consistent, fast way to close objects safely that does this close tacking and has also sped up object lifecycle quite a bit even where we already tried to things fast. > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-13796) Fix Solr Test Performance
Mark Miller created SOLR-13796: -- Summary: Fix Solr Test Performance Key: SOLR-13796 URL: https://issues.apache.org/jira/browse/SOLR-13796 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Mark Miller Assignee: Mark Miller I had kind of forgotten, but while working on Starburst I had realized that almost all of our tests are capable of being very fast and logging 10x less as a result. When they get this fast, a lot of infrequent random fails become frequent and things become much easier to debug. I had fixed a lot of issue to make tests pretty damn fast in the starburst branch, but tons of tests where still ignored due to the scope of changes going on. A variety of things have converged that have allowed me to absorb most of that work and build up on it while also almost finishing it. This will be another huge PR aimed at addressing issues that have our tests often take dozens of seconds to minutes when they should take mere seconds or 10. As part of this issue, I would like to move the focus of non nightly tests towards being more minimal, consistent and fast. In exchanged, we must put more effort and care in nightly tests. Not something that happens now, but if we have solid, fast, consistent non Nightly tests, that should open up some room for Nightly to get some status boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13722) Package Management APIs
[ https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938726#comment-16938726 ] Ishan Chattopadhyaya commented on SOLR-13722: - [~noble.paul] , I've updated the titles of this and SOLR-13710. Can you please update their descriptions to closely relate to the design document? This issue's description provides absolutely no clue what this issue is about. > Package Management APIs > --- > > Key: SOLR-13722 > URL: https://issues.apache.org/jira/browse/SOLR-13722 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Labels: package > > This ticket totally eliminates the need for an external service to host the > jars. So a url will no longer be required. An external URL leads to > unreliability because the service may go offline or it can be DDoSed if/when > too many requests are sent to them > > > Add a jar to cluster as follows > {code:java} > curl -X POST -H 'Content-Type: application/octet-stream' --data-binary > @myjar.jar http://localhost:8983/api/cluster/blob > {code} > This does the following operations > * Upload this jar to all the live nodes in the system > * The name of the file is the {{sha256}} of the file/payload > * The blob is agnostic of the content of the file/payload > h2. How it works? > A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted > locally & all nodes are instructed to download it from this node or from any > other available node. If a node comes up later, it can query other nodes in > the system and download the blobs as required > h2. {{add-package}} command > {code:java} > curl -X POST -H 'Content-type:application/json' --data-binary '{ > "add-package": { >"name": "my-package" , > "sha256":"" > }}' http://localhost:8983/api/cluster > {code} > The {{sha256}} is the same as the file name. It gets hold of the jar using > the following steps > * check the local file system for the blob > * If not available locally, query other live nodes if they have the blob > (one by one) > * if a node has it , it's downloaded and persisted to it's local {{blob}} dir > h2. Security > The blob upload does not check for the content of the payload and it does not > verify the file. However, the {{add-package}} , {{update-package}} commands > check for the signatures (if enabled) . > The size of the file is limited to 5MB,to avoid (OOM). This can be changed > using a system property {{runtime.lib.size}} . > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13710) FSBlobStore: a new blob store
[ https://issues.apache.org/jira/browse/SOLR-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishan Chattopadhyaya updated SOLR-13710: Description: * All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. * The file names will be the sha256 hash of the files. * Before downloading the a jar from a location, it's first checked in the local directory * POST a jar to http://localhost://8983/api/cluster/blob to distibute it in the cluster * A new API end point {{http://localhost://8983/api/node/blob}} will list the available jars example {code:json} { "blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc", "79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"] } {code} * The jar will be downloadable at {{http://localhost://8983/api/node/blob/}} Design: https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#heading=h.qxgax9a5br5o was: * All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. * The file names will be the sha256 hash of the files. * Before downloading the a jar from a location, it's first checked in the local directory * POST a jar to http://localhost://8983/api/cluster/blob to distibute it in the cluster * A new API end point {{http://localhost://8983/api/node/blob}} will list the available jars example {code:json} { "blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc", "79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"] } {code} * The jar will be downloadable at {{http://localhost://8983/api/node/blob/}} > FSBlobStore: a new blob store > - > > Key: SOLR-13710 > URL: https://issues.apache.org/jira/browse/SOLR-13710 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > * All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. > * The file names will be the sha256 hash of the files. > * Before downloading the a jar from a location, it's first checked in the > local directory > * POST a jar to http://localhost://8983/api/cluster/blob to distibute it in > the cluster > * A new API end point {{http://localhost://8983/api/node/blob}} will list the > available jars > example > {code:json} > { > "blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc", > "79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"] > } > {code} > * The jar will be downloadable at > {{http://localhost://8983/api/node/blob/}} > Design: > https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#heading=h.qxgax9a5br5o -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13710) FSBlobStore: a new blob store
[ https://issues.apache.org/jira/browse/SOLR-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishan Chattopadhyaya updated SOLR-13710: Summary: FSBlobStore: a new blob store (was: Persist package jars locally & expose them over http) > FSBlobStore: a new blob store > - > > Key: SOLR-13710 > URL: https://issues.apache.org/jira/browse/SOLR-13710 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > * All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. > * The file names will be the sha256 hash of the files. > * Before downloading the a jar from a location, it's first checked in the > local directory > * POST a jar to http://localhost://8983/api/cluster/blob to distibute it in > the cluster > * A new API end point {{http://localhost://8983/api/node/blob}} will list the > available jars > example > {code:json} > { > "blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc", > "79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"] > } > {code} > * The jar will be downloadable at > {{http://localhost://8983/api/node/blob/}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13722) A cluster-wide blob upload package option & avoid remote url
[ https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938717#comment-16938717 ] Ishan Chattopadhyaya commented on SOLR-13722: - [~dsmiley], based on the description, this issue seems to be to provide APIs to load/unload/update jars from blob store into the classpath. Corresponds to the "Package Management APIs", I think. Can you confirm, Noble? Also, Noble, I think the sub-tasks JIRAs should align closely with the design document sections. > A cluster-wide blob upload package option & avoid remote url > > > Key: SOLR-13722 > URL: https://issues.apache.org/jira/browse/SOLR-13722 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Labels: package > > This ticket totally eliminates the need for an external service to host the > jars. So a url will no longer be required. An external URL leads to > unreliability because the service may go offline or it can be DDoSed if/when > too many requests are sent to them > > > Add a jar to cluster as follows > {code:java} > curl -X POST -H 'Content-Type: application/octet-stream' --data-binary > @myjar.jar http://localhost:8983/api/cluster/blob > {code} > This does the following operations > * Upload this jar to all the live nodes in the system > * The name of the file is the {{sha256}} of the file/payload > * The blob is agnostic of the content of the file/payload > h2. How it works? > A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted > locally & all nodes are instructed to download it from this node or from any > other available node. If a node comes up later, it can query other nodes in > the system and download the blobs as required > h2. {{add-package}} command > {code:java} > curl -X POST -H 'Content-type:application/json' --data-binary '{ > "add-package": { >"name": "my-package" , > "sha256":"" > }}' http://localhost:8983/api/cluster > {code} > The {{sha256}} is the same as the file name. It gets hold of the jar using > the following steps > * check the local file system for the blob > * If not available locally, query other live nodes if they have the blob > (one by one) > * if a node has it , it's downloaded and persisted to it's local {{blob}} dir > h2. Security > The blob upload does not check for the content of the payload and it does not > verify the file. However, the {{add-package}} , {{update-package}} commands > check for the signatures (if enabled) . > The size of the file is limited to 5MB,to avoid (OOM). This can be changed > using a system property {{runtime.lib.size}} . > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13722) A cluster-wide blob upload package option & avoid remote url
[ https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938711#comment-16938711 ] David Wayne Smiley commented on SOLR-13722: --- There are two sub-tasks that, based on the title alone, seem like the same thing. This one and SOLR-13710 So it's not clear where to discuss a new/second "blob store". Can you help disambiguate these for me [~noble.paul]? > A cluster-wide blob upload package option & avoid remote url > > > Key: SOLR-13722 > URL: https://issues.apache.org/jira/browse/SOLR-13722 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Labels: package > > This ticket totally eliminates the need for an external service to host the > jars. So a url will no longer be required. An external URL leads to > unreliability because the service may go offline or it can be DDoSed if/when > too many requests are sent to them > > > Add a jar to cluster as follows > {code:java} > curl -X POST -H 'Content-Type: application/octet-stream' --data-binary > @myjar.jar http://localhost:8983/api/cluster/blob > {code} > This does the following operations > * Upload this jar to all the live nodes in the system > * The name of the file is the {{sha256}} of the file/payload > * The blob is agnostic of the content of the file/payload > h2. How it works? > A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted > locally & all nodes are instructed to download it from this node or from any > other available node. If a node comes up later, it can query other nodes in > the system and download the blobs as required > h2. {{add-package}} command > {code:java} > curl -X POST -H 'Content-type:application/json' --data-binary '{ > "add-package": { >"name": "my-package" , > "sha256":"" > }}' http://localhost:8983/api/cluster > {code} > The {{sha256}} is the same as the file name. It gets hold of the jar using > the following steps > * check the local file system for the blob > * If not available locally, query other live nodes if they have the blob > (one by one) > * if a node has it , it's downloaded and persisted to it's local {{blob}} dir > h2. Security > The blob upload does not check for the content of the payload and it does not > verify the file. However, the {{add-package}} , {{update-package}} commands > check for the signatures (if enabled) . > The size of the file is limited to 5MB,to avoid (OOM). This can be changed > using a system property {{runtime.lib.size}} . > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document
[ https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938700#comment-16938700 ] Atri Sharma commented on LUCENE-8990: - +1, I think that is a good heuristic – strangely enough, I was thinking of this limitation for a similar problem. Would it suffice if we just made PointRangeQuery also consider the BKDReader's docCount, in addition to pointCount? e.g. (cost = values.estimatePointCount() / values.estimateDocCount())? > IndexOrDocValuesQuery can take a bad decision for range queries if field has > many values per document > - > > Key: LUCENE-8990 > URL: https://issues.apache.org/jira/browse/LUCENE-8990 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ignacio Vera >Priority: Major > > Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range > queries . The leadCost that is provided is based on number of documents, > meanwhile the cost() of a range query is based on the number of points that > potentially match the query. > Therefore it might happen that a BKD tree has millions of points but this > points correspond to just a few documents. Therefore we can take the decision > of executing the query using docValues and in fact we are almost scanning all > the points. > Maybe the cost() function for range queries need to take into account the > average number of points per document in the tree and adjust the value > accordingly. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13791) Remove BeanUtils reference from ivy-versions.properties
[ https://issues.apache.org/jira/browse/SOLR-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938694#comment-16938694 ] Andras Salamon commented on SOLR-13791: --- There were more reference to beanutils, uploaded a new patch. > Remove BeanUtils reference from ivy-versions.properties > --- > > Key: SOLR-13791 > URL: https://issues.apache.org/jira/browse/SOLR-13791 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andras Salamon >Priority: Major > Attachments: SOLR-13791-01.patch, SOLR-13791-02.patch > > > SOLR-12617 removed Commons BeanUtils, but {{lucene/ivy-versions.properties}} > still have a reference to beanutils, because SOLR-9515 added this line back. > We can remove this line. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13791) Remove BeanUtils reference from ivy-versions.properties
[ https://issues.apache.org/jira/browse/SOLR-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Salamon updated SOLR-13791: -- Attachment: SOLR-13791-02.patch > Remove BeanUtils reference from ivy-versions.properties > --- > > Key: SOLR-13791 > URL: https://issues.apache.org/jira/browse/SOLR-13791 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andras Salamon >Priority: Major > Attachments: SOLR-13791-01.patch, SOLR-13791-02.patch > > > SOLR-12617 removed Commons BeanUtils, but {{lucene/ivy-versions.properties}} > still have a reference to beanutils, because SOLR-9515 added this line back. > We can remove this line. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] chatman commented on issue #898: SOLR-13661: A package management system for Solr
chatman commented on issue #898: SOLR-13661: A package management system for Solr URL: https://github.com/apache/lucene-solr/pull/898#issuecomment-535539988 * There are merge conflicts * The branch is SOLR-13722, but title is SOLR-13661. Is this raised from the right branch? * Seems like blob store changes are in this PR that is package manager related; shouldn't they be separated in different issues? * The commit messages seem very temporary, we should squash merge this branch (if we decide to do so). * One of the commits suggests "repossitory CRUD", whereas it shouldn't be here. Maybe it was removed in a subsequent commit, but the commit messages don't indicate there are any. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13764) Parse Interval Query from JSON API
[ https://issues.apache.org/jira/browse/SOLR-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938688#comment-16938688 ] Mikhail Khludnev commented on SOLR-13764: - Refer to https://cwiki.apache.org/confluence/display/SOLR/SOLR-13764+Discussion+-+Interval+Queries+in+JSON for the syntax proposal. > Parse Interval Query from JSON API > -- > > Key: SOLR-13764 > URL: https://issues.apache.org/jira/browse/SOLR-13764 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Reporter: Mikhail Khludnev >Priority: Major > > h2. Context > Lucene has Intervals query LUCENE-8196. Note: these are a kind of healthy > man's Spans/Phrases. Note: It's not about ranges nor facets. > h2. Problem > There's no way to search by IntervalQuery via JSON Query DSL. > h2. Suggestion > * Create classic QParser \{{ {!interval df=text_content}a_json_param}}, ie > one can combine a few such refs in {{json.query.bool}} > * It accepts just a name of JSON params, nothing like this happens yet. > * This param carries plain json which is accessible via {{req.getJSON()}} > {\{{}} > {{ query: {bool:{should:[}} > {{ \{interval:i_1},}} > {{ {interval: > {query:i_2, df:title} > }} > {{ }]}},}} > {{ params:{}} > {{ df: description_t,}} > {{ i_1:\{phrase:"lorem ipsum"},}} > {{ i_2:{ unordered: [ > {term:"bar"} > ,\{phrase:"bag ban"}]}}} > \{{ }}} > {{}}} > h2. Challenges > * I have no idea about particular JSON DSL for these queries, Lucene API > seems like easy JSON-able. Proposals are welcome. > * Another awkward things is combining analysis and low level query API. eg > what if one request term for one word and analysis yield two tokens, and vice > versa requesting phrase might end up with single token stream. > * Putting json into Jira ticket description > h2. Q: Why don't.. > .. put intervals DSL right into {{json.query}}, avoiding these odd param > refs? > A: It requires heavy lifting for {{JsonQueryConverter}} which is streamlined > for handling old good http parametrized queires. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thomaswoeckinger edited a comment on issue #902: SOLR-13795: Reload solr core after schema is persisted.
thomaswoeckinger edited a comment on issue #902: SOLR-13795: Reload solr core after schema is persisted. URL: https://github.com/apache/lucene-solr/pull/902#issuecomment-535515216 @dsmiley: May you have time to review? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thomaswoeckinger commented on issue #902: SOLR-13795: Reload solr core after schema is persisted.
thomaswoeckinger commented on issue #902: SOLR-13795: Reload solr core after schema is persisted. URL: https://github.com/apache/lucene-solr/pull/902#issuecomment-535515216 @dsmiley: May you have time to review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13795) SolrIndexSearcher still uses old schema after schema update using schema-api
[ https://issues.apache.org/jira/browse/SOLR-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wöckinger updated SOLR-13795: Labels: easyfix pull-request-available (was: easyfix) > SolrIndexSearcher still uses old schema after schema update using schema-api > > > Key: SOLR-13795 > URL: https://issues.apache.org/jira/browse/SOLR-13795 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: config-api, Schema and Analysis, Server, SolrJ, v2 API >Affects Versions: 7.7.2, master (9.0), 8.2 >Reporter: Thomas Wöckinger >Priority: Critical > Labels: easyfix, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When adding a new field to the schema using schema-api, the new field is not > known by the current SolrIndexSearcher. In SolrCloud any core gets reloaded > after the new schema is persisted, this does not happen in case of stand > alone HTTP Solr server or EmbeddedSolrServer. > So currently an additional commit is necessary to open a new > SolrIndexSearcher using the new schema. > Fix is really easy: Just reload the core! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thomaswoeckinger opened a new pull request #902: SOLR-13795: Reload solr core after schema is persisted.
thomaswoeckinger opened a new pull request #902: SOLR-13795: Reload solr core after schema is persisted. URL: https://github.com/apache/lucene-solr/pull/902 # Description Please provide a short description of the changes you're making with this pull request. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I am authorized to contribute this code to the ASF and have removed any code I do not have a license to distribute. - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `master` branch. - [ ] I have run `ant precommit` and the appropriate test suite. - [ ] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-13795) SolrIndexSearcher still uses old schema after schema update using schema-api
Thomas Wöckinger created SOLR-13795: --- Summary: SolrIndexSearcher still uses old schema after schema update using schema-api Key: SOLR-13795 URL: https://issues.apache.org/jira/browse/SOLR-13795 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: config-api, Schema and Analysis, Server, SolrJ, v2 API Affects Versions: 8.2, 7.7.2, master (9.0) Reporter: Thomas Wöckinger When adding a new field to the schema using schema-api, the new field is not known by the current SolrIndexSearcher. In SolrCloud any core gets reloaded after the new schema is persisted, this does not happen in case of stand alone HTTP Solr server or EmbeddedSolrServer. So currently an additional commit is necessary to open a new SolrIndexSearcher using the new schema. Fix is really easy: Just reload the core! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Closed] (SOLR-13788) Resolve multiple IPs from specified zookeeper URL
[ https://issues.apache.org/jira/browse/SOLR-13788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ween Jiann closed SOLR-13788. - > Resolve multiple IPs from specified zookeeper URL > - > > Key: SOLR-13788 > URL: https://issues.apache.org/jira/browse/SOLR-13788 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.1.1 >Reporter: Ween Jiann >Priority: Minor > Labels: features > > Use DNS lookup to get the IPs of the servers listed in ZK_HOST or -z param. > This would help cloud deployment as DNS is often used to group services > together. > [https://lucene.apache.org/solr/guide/8_1/setting-up-an-external-zookeeper-ensemble.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13788) Resolve multiple IPs from specified zookeeper URL
[ https://issues.apache.org/jira/browse/SOLR-13788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ween Jiann resolved SOLR-13788. --- Resolution: Not A Problem > Resolve multiple IPs from specified zookeeper URL > - > > Key: SOLR-13788 > URL: https://issues.apache.org/jira/browse/SOLR-13788 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.1.1 >Reporter: Ween Jiann >Priority: Minor > Labels: features > > Use DNS lookup to get the IPs of the servers listed in ZK_HOST or -z param. > This would help cloud deployment as DNS is often used to group services > together. > [https://lucene.apache.org/solr/guide/8_1/setting-up-an-external-zookeeper-ensemble.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document
Ignacio Vera created LUCENE-8990: Summary: IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document Key: LUCENE-8990 URL: https://issues.apache.org/jira/browse/LUCENE-8990 Project: Lucene - Core Issue Type: Bug Reporter: Ignacio Vera Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range queries . The leadCost that is provided is based on number of documents, meanwhile the cost() of a range query is based on the number of points that potentially match the query. Therefore it might happen that a BKD tree has millions of points but this points correspond to just a few documents. Therefore we can take the decision of executing the query using docValues and in fact we are almost scanning all the points. Maybe the cost() function for range queries need to take into account the average number of points per document in the tree and adjust the value accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938496#comment-16938496 ] Guoqiang Jiang commented on LUCENE-8980: Hi, [~dsmiley], thanks for your suggestion. I have updated the description and comments. Please help to commit this improvement. Thanks again. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an indexed _id > field that uniquely identifies it. When Elasticsearch use the _id field to > find a document from Lucene, Lucene have to check all the segments of the > index. When the values of the _id field are very sequentially, the > performance is optimizable. > > *Solution* > Since Lucene stores min/maxTerm metrics for each segment and field, we can > use those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in an index, we can check > whether the term fall in the range of minTerm and maxTerm, so that we can > skip some useless segments as soon as possible. > > This improvement is beneficial to ES read/write API and Lucene look up API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Jiang updated LUCENE-8980: --- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an indexed _id field that uniquely identifies it. When Elasticsearch use the _id field to find a document from Lucene, Lucene have to check all the segments of the index. When the values of the _id field are very sequentially, the performance is optimizable. *Solution* Since Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in an index, we can check whether the term fall in the range of minTerm and maxTerm, so that we can skip some useless segments as soon as possible. This improvement is beneficial to ES read/write API and Lucene look up API. was: *Description* In Elasticsearch, which is based on Lucene, each document has an indexed _id field that uniquely identifies it. When Elasticsearch use the _id field to find a document from Lucene, Lucene have to check all the segments of the index. When the values of the _id field are very sequentially, the performance is optimizable. *Solution* Since Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in an index, we can check whether the term fall in the range of minTerm and maxTerm, so that we can skip some useless segments as soon as possible. This PR is beneficial to ES read/write API and Lucene look up API. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an indexed _id > field that uniquely identifies it. When Elasticsearch use the _id field to > find a document from Lucene, Lucene have to check all the segments of the > index. When the values of the _id field are very sequentially, the > performance is optimizable. > > *Solution* > Since Lucene stores min/maxTerm metrics for each segment and field, we can > use those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in an index, we can check > whether the term fall in the range of minTerm and maxTerm, so that we can > skip some useless segments as soon as possible. > > This improvement is beneficial to ES read/write API and Lucene look up API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938423#comment-16938423 ] Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 10:50 AM: -- We run another test case _wikimedium10m _to verify the improvement on a large data set. The complete results are [here|https://gist.github.com/jgq2008303393/44768d69a843c7b421e765bbab9360fd.js]. The following table is the result of the last run: |TaskQPS |baseline|StdDevQPS|my_modified_version| StdDev |Pct_diff(percent_diff)| | --- | :: | :-: | :---: | :: | :--: | |OrHighNotLow | 293.93 | (5.8%) | 286.46 | (6.6%) | -2.5%(-14% - 10%)| | OrHighNotHigh | 258.18 | (3.7%) | 252.41 | (5.0%) | -2.2%(-10% - 6%)| | OrHighLow | 206.52 | (6.2%) | 202.55 | (6.2%) | -1.9%(-13% - 11%)| | MedPhrase | 16.41 | (4.1%) | 16.12 | (2.6%) | -1.7%( -8% - 5%)| | LowTerm | 608.71 | (5.7%) | 599.21 | (4.4%) | -1.6%(-10% - 9%)| | Prefix3 | 37.96 | (2.8%) | 37.51 | (3.8%) | -1.2%( -7% - 5%)| | OrNotHighHigh | 255.49 | (5.5%) | 252.63 | (6.1%) | -1.1%(-12% - 11%)| | MedSloppyPhrase | 13.71 | (3.5%) | 13.58 | (3.7%) | -1.0%( -7% - 6%)| |HighSloppyPhrase | 17.00 | (3.3%) | 16.84 | (3.7%) | -0.9%( -7% - 6%)| | OrHighHigh | 19.02 | (2.6%) | 18.85 | (2.7%) | -0.9%( -6% - 4%)| | MedTerm | 564.56 | (4.6%) | 559.38 | (2.9%) | -0.9%( -8% - 6%)| |OrNotHighLow | 294.29 | (4.9%) | 291.86 | (4.2%) | -0.8%( -9% - 8%)| | AndHighLow | 303.17 | (3.7%) | 300.72 | (4.5%) | -0.8%( -8% - 7%)| | AndHighHigh | 28.24 | (2.1%) | 28.01 | (2.7%) | -0.8%( -5% - 4%)| |Wildcard | 64.64 | (3.9%) | 64.21 | (4.0%) | -0.7%( -8% - 7%)| |HighSpanNear | 15.14 | (2.8%) | 15.04 | (2.5%) | -0.7%( -5% - 4%)| |HighTerm | 431.22 | (3.9%) | 428.68 | (2.9%) | -0.6%( -7% - 6%)| | LowSloppyPhrase | 19.29 | (2.2%) | 19.18 | (2.9%) | -0.6%( -5% - 4%)| | LowSpanNear | 64.32 | (2.3%) | 63.99 | (2.0%) | -0.5%( -4% - 3%)| | Fuzzy2 | 34.51 |(12.8%) | 34.34 |(11.9%) | -0.5%(-22% - 27%)| | MedSpanNear | 51.51 | (2.3%) | 51.28 | (1.6%) | -0.4%( -4% - 3%)| | HighTermDayOfYearSort | 51.45 | (6.6%) | 51.24 | (7.5%) | -0.4%(-13% - 14%)| |OrHighNotMed | 306.95 | (5.1%) | 306.03 | (3.2%) | -0.3%( -8% - 8%)| |BrowseDateTaxoFacets | 1.48 | (0.6%) |1.47 | (1.2%) | -0.2%( -1% - 1%)| | BrowseMonthSSDVFacets | 6.15 | (1.1%) |6.14 | (3.6%) | -0.2%( -4% - 4%)| | HighPhrase | 186.86 | (6.2%) | 186.64 | (3.7%) | -0.1%( -9% - 10%)| | Respell | 48.69 | (4.1%) | 48.65 | (4.0%) | -0.1%( -7% - 8%)| | AndHighMed | 65.66 | (3.0%) | 65.74 | (3.2%) | 0.1%( -5% - 6%)| |HighIntervalsOrdered | 6.68 | (1.5%) |6.69 | (1.7%) | 0.1%( -3% - 3%)| | LowPhrase | 219.11 | (5.7%) | 220.24 | (3.5%) | 0.5%( -8% - 10%)| | OrHighMed | 68.05 | (4.5%) | 68.44 | (3.1%) | 0.6%( -6% - 8%)| |OrNotHighMed | 272.89 | (5.7%) | 274.77 | (4.1%) | 0.7%( -8% - 11%)| | IntNRQ | 37.58 |(23.8%) | 37.96 |(24.2%) | 1.0%(-37% - 64%)| |BrowseDayOfYearSSDVFacets| 5.34 | (4.2%) |5.40 | (2.9%) | 1.2%( -5% - 8%)| | HighTermMonthSort | 34.82 |(11.7%) | 35.81 |(14.9%) | 2.9%(-21% - 33%)| | BrowseMonthTaxoFacets |4781.41 | (3.9%) | 4931.19 | (2.7%) | 3.1%( -3% - 10%)| | Fuzzy1 | 35.98 | (9.7%) | 37.42 | (8.0%) | 4.0%(-12% - 23%)| |BrowseDayOfYearTaxoFacets|4688.64 | (3.6%) | 4878.52 | (3.6%) | 4.0%( -3% - 11%)| |PKLookup | 72.93 | (4.7%) | 95.23 | (3.3%) | 30.6%( 21% - 40%)| was (Author: jgq2008303393): We run another test case _wikimedium10m _to verify the improvement on a large data set. The complete results are [here|https://gist.github.com/jgq2008303393/44768d69a843c7b421e765bbab9360fd.js]. The following table is the result of the last run: |TaskQPS |baseline|StdDevQPS|my_modified_version|
[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416 ] Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 10:50 AM: -- We have done more performance test using _luceneutil_ tool. And the complete test results are [here|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]. The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the _PKLookup_ indicator has a performance improvement of 58.7%. |TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)| |HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)| |MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)| |LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)| |AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)| |BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)| |HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)| |Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)| |LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)| |Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)| |AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)| |OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)| |AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)| |LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)| |IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)| |HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)| |Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)| |OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)| |BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)| |Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)| |MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)| |MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)| |HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)| |BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)| |Respell|187.05|(12.2%)|189.48|(10.1%)|1.3%(-18% - 26%)| |MedSloppyPhrase|583.45|(11.3%)|592.32|(9.9%)|1.5%(-17% - 25%)| |HighTerm|1114.87|(12.0%)|1131.89|(12.8%)|1.5%(-20% - 29%)| |HighTermDayOfYearSort|408.17|(13.1%)|416.13|(9.3%)|1.9%(-18% - 27%)| |BrowseDayOfYearTaxoFacets|5460.05|(8.5%)|5591.96|(8.0%)|2.4%(-13% - 20%)| |BrowseMonthTaxoFacets|5490.18|(8.0%)|5654.03|(9.3%)|3.0%(-13% - 22%)| |LowSloppyPhrase|562.96|(10.1%)|583.91|(9.5%)|3.7%(-14% - 25%)| |HighPhrase|221.20|(11.9%)|229.85|(12.2%)|3.9%(-17% - 31%)| |OrHighMed|352.09|(12.3%)|369.39|(9.4%)|4.9%(-14% - 30%)| |PKLookup|85.19|(18.1%)|135.38|(22.7%)|58.9%( 15% - 121%)| was (Author: jgq2008303393): We have done more performance test using _luceneutil_ tool. And the complete test results are [here|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]. The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the _PKLookup_ indicator has a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch will also take benefit of this enhancement. |TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)| |HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)| |MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)| |LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)| |AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)| |BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)| |HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)| |Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)| |LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)| |Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)| |AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)| |OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)| |AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)| |LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)| |IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)| |HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)| |Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)| |OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)| |BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)| |Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)| |MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)| |MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)| |HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)| |BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)|
[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938415#comment-16938415 ] Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 10:49 AM: -- *Tests* We have made some write benchmark with _id values in UUID V1 format, and the write performance of Elasticsearch is as follows: ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed after 8h||CPU cost||Overall improvement|| |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A| |Optimised Lucene|34.5w/s (+15.4%)|63.8 (-6.7%)|+22.1%|31.5w/s (18.0%)|61.5 (-7.7%)|+25.7%| As shown above, after 8 hours of continuous writing, write speed improves by 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. The search API of Elasticsearch will also take benefit of this improvement. It should be noted that the benchmark test needs to be run several hours continuously, because the performance improvements is not obvious when the data is completely cached or the number of segments is too small. was (Author: jgq2008303393): *Tests* We have made some write benchmark using _id in UUID V1 format, and the benchmark result is as follows: ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed after 8h||CPU cost||Overall improvement|| |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A| |Optimised Lucene|34.5w/s (+15.4%)|63.8 (-6.7%)|+22.1%|31.5w/s (18.0%)|61.5 (-7.7%)|+25.7%| As shown above, after 8 hours of continuous writing, write speed improves by 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. The Get and Bulk API of Elasticsearch will also take benefit of this enhancement. It should be noted that the benchmark test needs to be run several hours continuously, because the performance improvements is not obvious when the data is completely cached or the number of segments is too small. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an indexed _id > field that uniquely identifies it. When Elasticsearch use the _id field to > find a document from Lucene, Lucene have to check all the segments of the > index. When the values of the _id field are very sequentially, the > performance is optimizable. > > *Solution* > Since Lucene stores min/maxTerm metrics for each segment and field, we can > use those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in an index, we can check > whether the term fall in the range of minTerm and maxTerm, so that we can > skip some useless segments as soon as possible. > > This PR is beneficial to ES read/write API and Lucene look up API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Jiang updated LUCENE-8980: --- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an indexed _id field that uniquely identifies it. When Elasticsearch use the _id field to find a document from Lucene, Lucene have to check all the segments of the index. When the values of the _id field are very sequentially, the performance is optimizable. *Solution* Since Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in an index, we can check whether the term fall in the range of minTerm and maxTerm, so that we can skip some useless segments as soon as possible. This PR is beneficial to ES read/write API and Lucene look up API. was: *Description* In Elasticsearch, which is based on Lucene, each document has an indexed _id field that uniquely identifies it. When Elasticsearch use the _id field to find a document from Lucene, Lucene have to check all the segments of the index. When the values of the _id field are very sequentially, the performance is optimizable. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in an index, we can check whether the term fall in the range of minTerm and maxTerm, so that we can skip some useless segments as soon as possible. This PR is beneficial to ES read/write API and Lucene look up API. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an indexed _id > field that uniquely identifies it. When Elasticsearch use the _id field to > find a document from Lucene, Lucene have to check all the segments of the > index. When the values of the _id field are very sequentially, the > performance is optimizable. > > *Solution* > Since Lucene stores min/maxTerm metrics for each segment and field, we can > use those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in an index, we can check > whether the term fall in the range of minTerm and maxTerm, so that we can > skip some useless segments as soon as possible. > > This PR is beneficial to ES read/write API and Lucene look up API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Jiang updated LUCENE-8980: --- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an indexed _id field that uniquely identifies it. When Elasticsearch use the _id field to find a document from Lucene, Lucene have to check all the segments of the index. When the values of the _id field are very sequentially, the performance is optimizable. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in an index, we can check whether the term fall in the range of minTerm and maxTerm, so that we can skip some useless segments as soon as possible. This PR is beneficial to ES read/write API and Lucene look up API. was: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it. The _id field is indexed so that each document can be looked up from Lucene. When users write documents with sequentially _id values, Elasticsearch lookup up t from check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an indexed _id > field that uniquely identifies it. When Elasticsearch use the _id field to > find a document from Lucene, Lucene have to check all the segments of the > index. When the values of the _id field are very sequentially, the > performance is optimizable. > > *Solution* > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in an index, we can check > whether the term fall in the range of minTerm and maxTerm, so that we can > skip some useless segments as soon as possible. > > This PR is beneficial to ES read/write API and Lucene look up API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Jiang updated LUCENE-8980: --- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it. The _id field is indexed so that each document can be looked up from Lucene. When users write documents with sequentially _id values, Elasticsearch lookup up t from check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. was: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it. The _id field is indexed so that each document can be looked up from Lucene. When users write data with sequentially _id values, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an _id field > that uniquely identifies it. The _id field is indexed so that each document > can be looked up from Lucene. When users write documents with sequentially > _id values, Elasticsearch lookup up t from check _id uniqueness through > Lucene API for each document, which result in poor write performance. > > *Solution* > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check > whether the term fall in the range of minTerm and maxTerm, so that wo skip > some useless segments as soon as possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Jiang updated LUCENE-8980: --- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it. The _id field is indexed so that each document can be looked up from Lucene. When users write data with sequentially _id values, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. was: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it, which is indexed so that documents can be looked up from Lucene. When users write data with self-generated _id values, even if the conflict rate is very low, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an _id field > that uniquely identifies it. The _id field is indexed so that each document > can be looked up from Lucene. When users write data with sequentially _id > values, Elasticsearch has to check _id uniqueness through Lucene API for each > document, which result in poor write performance. > > *Solution* > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check > whether the term fall in the range of minTerm and maxTerm, so that wo skip > some useless segments as soon as possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Jiang updated LUCENE-8980: --- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it, which is indexed so that documents can be looked up from Lucene. When users write data with self-generated _id values, even if the conflict rate is very low, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. was: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it, which is indexed so that documents can be looked up from Lucene. When users write Elasticsearch with self-generated _id values, even if the conflict rate is very low, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an _id field > that uniquely identifies it, which is indexed so that documents can be looked > up from Lucene. When users write data with self-generated _id values, even if > the conflict rate is very low, Elasticsearch has to check _id uniqueness > through Lucene API for each document, which result in poor write performance. > > *Solution* > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check > whether the term fall in the range of minTerm and maxTerm, so that wo skip > some useless segments as soon as possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8928) BKDWriter could make splitting decisions based on the actual range of values
[ https://issues.apache.org/jira/browse/LUCENE-8928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938429#comment-16938429 ] Ignacio Vera commented on LUCENE-8928: -- Run some benchmarks by comparing this new approach with the previous approach shown a similar query performance but a much faster indexing rate: ||Approach||Index time (sec)||Index time (sec)|| ||Force merge time (sec)||Force merge time (sec)|| ||Index size (GB)||Index size (GB)|| ||Reader heap (MB)||Reader heap (MB)|| || ||Dev||Base||Diff||Dev||Base||diff||Dev||Base||Diff||Dev||Base||Diff|| |geo3d|163.5s|218.4s|-25%|0.0s|0.0s| 0%|0.71|0.71|-0%|1.75|1.75|-0%| |shapes|227.8s|319.6s|-29%|0.0s|0.0s| 0%|1.27|1.27| 0%|1.78|1.78| 0%| ||Approach||Shape||M hits/sec||M hits/sec|| ||QPS ||QPS || ||Hit count ||Hit count|| || || ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff|| |geo3d|box|55.58|57.53|-3%|56.56|58.54|-3%|221118844|221118844| 0%| |geo3d|polyRussia|0.56|0.56|-1%|0.16|0.16|-1%|3508671|3508671| 0%| |geo3d|poly 10|48.87|51.25|-5%|30.90|32.41|-5%|355855227|355855227| 0%| |geo3d|polyMedium|0.62|0.63|-1%|7.64|7.67|-1%|2693545|2693545| 0%| |geo3d|distance|68.16|69.70|-2%|40.00|40.91|-2%|383371884|383371884| 0%| |shapes|box|45.99|46.52|-1%|46.80|47.34|-1%|221118844|221118844| 0%| |shapes|polyRussia|6.64|7.01|-5%|1.89|2.00|-5%|3508846|3508846| 0%| |shapes|poly 10|33.40|34.69|-4%|21.12|21.93|-4%|355809475|355809475| 0%| |shapes|polyMedium|3.07|3.30|-7%|37.62|40.43|-7%|2693559|2693559| 0%| > BKDWriter could make splitting decisions based on the actual range of values > > > Key: LUCENE-8928 > URL: https://issues.apache.org/jira/browse/LUCENE-8928 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Currently BKDWriter assumes that splitting on one dimension has no effect on > values in other dimensions. While this may be ok for geo points, this is > usually not true for ranges (or geo shapes, which are ranges too). Maybe we > could get better indexing by re-computing the range of values on each > dimension before making the choice of the split dimension? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938415#comment-16938415 ] Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 9:12 AM: - *Tests* We have made some write benchmark using _id in UUID V1 format, and the benchmark result is as follows: ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed after 8h||CPU cost||Overall improvement|| |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A| |Optimised Lucene|34.5w/s (+15.4%)|63.8 (-6.7%)|+22.1%|31.5w/s (18.0%)|61.5 (-7.7%)|+25.7%| As shown above, after 8 hours of continuous writing, write speed improves by 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. The Get and Bulk API of Elasticsearch will also take benefit of this enhancement. It should be noted that the benchmark test needs to be run several hours continuously, because the performance improvements is not obvious when the data is completely cached or the number of segments is too small. was (Author: jgq2008303393): *Tests* I have made some write benchmark using _id in UUID V1 format, and the benchmark result is as follows: ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed after 8h||CPU cost||Overall improvement|| |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A| |Optimised Lucene|34.5w/s (+15.4%)|63.8 (-6.7%)|+22.1%|31.5w/s (18.0%)|61.5 (-7.7%)|+25.7%| As shown above, after 8 hours of continuous writing, write speed improves by 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. The Get and Bulk API of Elasticsearch will also take benefit of this enhancement. It should be noted that the benchmark test needs to be run several hours continuously, because the performance improvements is not obvious when the data is completely cached or the number of segments is too small. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an _id field > that uniquely identifies it, which is indexed so that documents can be looked > up from Lucene. When users write Elasticsearch with self-generated _id > values, even if the conflict rate is very low, Elasticsearch has to check _id > uniqueness through Lucene API for each document, which result in poor write > performance. > > *Solution* > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check > whether the term fall in the range of minTerm and maxTerm, so that wo skip > some useless segments as soon as possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416 ] Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 9:11 AM: - We have done more performance test using _luceneutil_ tool. And the complete test results are [here|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]. The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the _PKLookup_ indicator has a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch will also take benefit of this enhancement. |TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)| |HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)| |MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)| |LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)| |AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)| |BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)| |HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)| |Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)| |LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)| |Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)| |AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)| |OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)| |AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)| |LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)| |IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)| |HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)| |Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)| |OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)| |BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)| |Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)| |MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)| |MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)| |HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)| |BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)| |Respell|187.05|(12.2%)|189.48|(10.1%)|1.3%(-18% - 26%)| |MedSloppyPhrase|583.45|(11.3%)|592.32|(9.9%)|1.5%(-17% - 25%)| |HighTerm|1114.87|(12.0%)|1131.89|(12.8%)|1.5%(-20% - 29%)| |HighTermDayOfYearSort|408.17|(13.1%)|416.13|(9.3%)|1.9%(-18% - 27%)| |BrowseDayOfYearTaxoFacets|5460.05|(8.5%)|5591.96|(8.0%)|2.4%(-13% - 20%)| |BrowseMonthTaxoFacets|5490.18|(8.0%)|5654.03|(9.3%)|3.0%(-13% - 22%)| |LowSloppyPhrase|562.96|(10.1%)|583.91|(9.5%)|3.7%(-14% - 25%)| |HighPhrase|221.20|(11.9%)|229.85|(12.2%)|3.9%(-17% - 31%)| |OrHighMed|352.09|(12.3%)|369.39|(9.4%)|4.9%(-14% - 30%)| |PKLookup|85.19|(18.1%)|135.38|(22.7%)|58.9%( 15% - 121%)| was (Author: jgq2008303393): We have done more performance test using _luceneutil_ tool. And the complete test results are [here|[https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]. The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the _PKLookup_ indicator has a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch will also take benefit of this enhancement. |TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)| |HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)| |MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)| |LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)| |AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)| |BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)| |HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)| |Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)| |LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)| |Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)| |AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)| |OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)| |AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)| |LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)| |IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)| |HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)| |Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)| |OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)| |BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)| |Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)| |MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)| |MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% -
[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416 ] Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 9:11 AM: - We have done more performance test using _luceneutil_ tool. And the complete test results are [here|[https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]. The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the _PKLookup_ indicator has a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch will also take benefit of this enhancement. |TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)| |HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)| |MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)| |LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)| |AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)| |BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)| |HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)| |Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)| |LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)| |Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)| |AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)| |OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)| |AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)| |LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)| |IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)| |HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)| |Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)| |OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)| |BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)| |Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)| |MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)| |MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)| |HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)| |BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)| |Respell|187.05|(12.2%)|189.48|(10.1%)|1.3%(-18% - 26%)| |MedSloppyPhrase|583.45|(11.3%)|592.32|(9.9%)|1.5%(-17% - 25%)| |HighTerm|1114.87|(12.0%)|1131.89|(12.8%)|1.5%(-20% - 29%)| |HighTermDayOfYearSort|408.17|(13.1%)|416.13|(9.3%)|1.9%(-18% - 27%)| |BrowseDayOfYearTaxoFacets|5460.05|(8.5%)|5591.96|(8.0%)|2.4%(-13% - 20%)| |BrowseMonthTaxoFacets|5490.18|(8.0%)|5654.03|(9.3%)|3.0%(-13% - 22%)| |LowSloppyPhrase|562.96|(10.1%)|583.91|(9.5%)|3.7%(-14% - 25%)| |HighPhrase|221.20|(11.9%)|229.85|(12.2%)|3.9%(-17% - 31%)| |OrHighMed|352.09|(12.3%)|369.39|(9.4%)|4.9%(-14% - 30%)| |PKLookup|85.19|(18.1%)|135.38|(22.7%)|58.9%( 15% - 121%)| was (Author: jgq2008303393): We have done more performance test using _luceneutil_ tool. And the complete test results are [here](https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js). The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the _PKLookup_ indicator has a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch will also take benefit of this enhancement. | TaskQPS |baseline|StdDevQPS|my_modified_version| StdDev |Pct_diff(percent_diff)| |HighIntervalsOrdered | 303.36 | (12.5%) | 283.86 | (16.9%) | -6.4%(-31% - 26%) | | MedPhrase | 404.26 | (12.3%) | 382.64 | (10.5%) | -5.3%(-25% - 19%) | | LowTerm |2302.28 | (8.7%) | 2180.74 | (11.8%) | -5.3%(-23% - 16%) | | AndHighMed | 618.78 | (10.1%) | 586.61 | (11.8%) | -5.2%(-24% - 18%) | |BrowseDayOfYearSSDVFacets|1042.68 | (10.1%) | 992.82 | (10.7%) | -4.8%(-23% - 17%) | |HighSpanNear | 263.62 | (12.9%) | 256.07 | (14.9%) | -2.9%(-27% - 28%) | |Wildcard | 221.10 | (16.2%) | 215.32 | (11.9%) | -2.6%(-26% - 30%) | | LowSpanNear | 656.60 | (7.9%) | 639.77 | (11.3%) | -2.6%(-20% - 18%) | | Fuzzy1 | 135.61 | (9.1%) | 132.26 | (10.4%) | -2.5%(-20% - 18%) | | AndHighHigh | 409.88 | (10.9%) | 399.79 | (12.6%) | -2.5%(-23% - 23%) | | OrHighHigh | 318.45 | (12.9%) | 312.43 | (12.2%) | -1.9%(-23% - 26%) | | AndHighLow | 937.17 | (10.2%) | 921.71 | (11.4%) | -1.6%(-21% - 22%) | | LowPhrase |
[jira] [Commented] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938423#comment-16938423 ] Guoqiang Jiang commented on LUCENE-8980: We run another test case _wikimedium10m _to verify the improvement on a large data set. The complete results are [here|https://gist.github.com/jgq2008303393/44768d69a843c7b421e765bbab9360fd.js]. The following table is the result of the last run: |TaskQPS |baseline|StdDevQPS|my_modified_version| StdDev |Pct_diff(percent_diff)| | --- | :: | :-: | :---: | :: | :--: | |OrHighNotLow | 293.93 | (5.8%) | 286.46 | (6.6%) | -2.5%(-14% - 10%)| | OrHighNotHigh | 258.18 | (3.7%) | 252.41 | (5.0%) | -2.2%(-10% - 6%)| | OrHighLow | 206.52 | (6.2%) | 202.55 | (6.2%) | -1.9%(-13% - 11%)| | MedPhrase | 16.41 | (4.1%) | 16.12 | (2.6%) | -1.7%( -8% - 5%)| | LowTerm | 608.71 | (5.7%) | 599.21 | (4.4%) | -1.6%(-10% - 9%)| | Prefix3 | 37.96 | (2.8%) | 37.51 | (3.8%) | -1.2%( -7% - 5%)| | OrNotHighHigh | 255.49 | (5.5%) | 252.63 | (6.1%) | -1.1%(-12% - 11%)| | MedSloppyPhrase | 13.71 | (3.5%) | 13.58 | (3.7%) | -1.0%( -7% - 6%)| |HighSloppyPhrase | 17.00 | (3.3%) | 16.84 | (3.7%) | -0.9%( -7% - 6%)| | OrHighHigh | 19.02 | (2.6%) | 18.85 | (2.7%) | -0.9%( -6% - 4%)| | MedTerm | 564.56 | (4.6%) | 559.38 | (2.9%) | -0.9%( -8% - 6%)| |OrNotHighLow | 294.29 | (4.9%) | 291.86 | (4.2%) | -0.8%( -9% - 8%)| | AndHighLow | 303.17 | (3.7%) | 300.72 | (4.5%) | -0.8%( -8% - 7%)| | AndHighHigh | 28.24 | (2.1%) | 28.01 | (2.7%) | -0.8%( -5% - 4%)| |Wildcard | 64.64 | (3.9%) | 64.21 | (4.0%) | -0.7%( -8% - 7%)| |HighSpanNear | 15.14 | (2.8%) | 15.04 | (2.5%) | -0.7%( -5% - 4%)| |HighTerm | 431.22 | (3.9%) | 428.68 | (2.9%) | -0.6%( -7% - 6%)| | LowSloppyPhrase | 19.29 | (2.2%) | 19.18 | (2.9%) | -0.6%( -5% - 4%)| | LowSpanNear | 64.32 | (2.3%) | 63.99 | (2.0%) | -0.5%( -4% - 3%)| | Fuzzy2 | 34.51 |(12.8%) | 34.34 |(11.9%) | -0.5%(-22% - 27%)| | MedSpanNear | 51.51 | (2.3%) | 51.28 | (1.6%) | -0.4%( -4% - 3%)| | HighTermDayOfYearSort | 51.45 | (6.6%) | 51.24 | (7.5%) | -0.4%(-13% - 14%)| |OrHighNotMed | 306.95 | (5.1%) | 306.03 | (3.2%) | -0.3%( -8% - 8%)| |BrowseDateTaxoFacets | 1.48 | (0.6%) |1.47 | (1.2%) | -0.2%( -1% - 1%)| | BrowseMonthSSDVFacets | 6.15 | (1.1%) |6.14 | (3.6%) | -0.2%( -4% - 4%)| | HighPhrase | 186.86 | (6.2%) | 186.64 | (3.7%) | -0.1%( -9% - 10%)| | Respell | 48.69 | (4.1%) | 48.65 | (4.0%) | -0.1%( -7% - 8%)| | AndHighMed | 65.66 | (3.0%) | 65.74 | (3.2%) | 0.1%( -5% - 6%)| |HighIntervalsOrdered | 6.68 | (1.5%) |6.69 | (1.7%) | 0.1%( -3% - 3%)| | LowPhrase | 219.11 | (5.7%) | 220.24 | (3.5%) | 0.5%( -8% - 10%)| | OrHighMed | 68.05 | (4.5%) | 68.44 | (3.1%) | 0.6%( -6% - 8%)| |OrNotHighMed | 272.89 | (5.7%) | 274.77 | (4.1%) | 0.7%( -8% - 11%)| | IntNRQ | 37.58 |(23.8%) | 37.96 |(24.2%) | 1.0%(-37% - 64%)| |BrowseDayOfYearSSDVFacets| 5.34 | (4.2%) |5.40 | (2.9%) | 1.2%( -5% - 8%)| | HighTermMonthSort | 34.82 |(11.7%) | 35.81 |(14.9%) | 2.9%(-21% - 33%)| | BrowseMonthTaxoFacets |4781.41 | (3.9%) | 4931.19 | (2.7%) | 3.1%( -3% - 10%)| | Fuzzy1 | 35.98 | (9.7%) | 37.42 | (8.0%) | 4.0%(-12% - 23%)| |BrowseDayOfYearTaxoFacets|4688.64 | (3.6%) | 4878.52 | (3.6%) | 4.0%( -3% - 11%)| |PKLookup | 72.93 | (4.7%) | 95.23 | (3.3%) | 30.6%( 21% - 40%)| > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >
[jira] [Commented] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416 ] Guoqiang Jiang commented on LUCENE-8980: We have done more performance test using _luceneutil_ tool. And the complete test results are [here](https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js). The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the _PKLookup_ indicator has a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch will also take benefit of this enhancement. | TaskQPS |baseline|StdDevQPS|my_modified_version| StdDev |Pct_diff(percent_diff)| |HighIntervalsOrdered | 303.36 | (12.5%) | 283.86 | (16.9%) | -6.4%(-31% - 26%) | | MedPhrase | 404.26 | (12.3%) | 382.64 | (10.5%) | -5.3%(-25% - 19%) | | LowTerm |2302.28 | (8.7%) | 2180.74 | (11.8%) | -5.3%(-23% - 16%) | | AndHighMed | 618.78 | (10.1%) | 586.61 | (11.8%) | -5.2%(-24% - 18%) | |BrowseDayOfYearSSDVFacets|1042.68 | (10.1%) | 992.82 | (10.7%) | -4.8%(-23% - 17%) | |HighSpanNear | 263.62 | (12.9%) | 256.07 | (14.9%) | -2.9%(-27% - 28%) | |Wildcard | 221.10 | (16.2%) | 215.32 | (11.9%) | -2.6%(-26% - 30%) | | LowSpanNear | 656.60 | (7.9%) | 639.77 | (11.3%) | -2.6%(-20% - 18%) | | Fuzzy1 | 135.61 | (9.1%) | 132.26 | (10.4%) | -2.5%(-20% - 18%) | | AndHighHigh | 409.88 | (10.9%) | 399.79 | (12.6%) | -2.5%(-23% - 23%) | | OrHighHigh | 318.45 | (12.9%) | 312.43 | (12.2%) | -1.9%(-23% - 26%) | | AndHighLow | 937.17 | (10.2%) | 921.71 | (11.4%) | -1.6%(-21% - 22%) | | LowPhrase | 385.06 | (12.3%) | 379.83 | (10.8%) | -1.4%(-21% - 24%) | | IntNRQ | 618.69 | (14.1%) | 610.58 | (10.6%) | -1.3%(-22% - 27%) | | HighTermMonthSort |1178.14 | (9.5%) | 1164.48 | (12.6%) | -1.2%(-21% - 23%) | | Fuzzy2 | 46.95 | (16.2%) | 46.57 | (15.6%) | -0.8%(-28% - 36%) | | OrHighLow | 633.64 | (9.6%) | 629.21 | (9.9%) | -0.7%(-18% - 20%) | | BrowseMonthSSDVFacets |1157.34 | (12.1%) | 1155.63 | (13.5%) | -0.1%(-23% - 29%) | | Prefix3 | 297.40 | (12.1%) | 298.16 | (12.7%) | 0.3%(-21% - 28%) | | MedSpanNear | 434.56 | (10.0%) | 437.02 | (11.4%) | 0.6%(-19% - 24%) | | MedTerm |2158.68 | (8.8%) | 2177.67 | (11.1%) | 0.9%(-17% - 22%) | |HighSloppyPhrase | 320.36 | (10.0%) | 323.46 | (14.6%) | 1.0%(-21% - 28%) | |BrowseDateTaxoFacets |2065.89 | (13.7%) | 2088.22 | (13.2%) | 1.1%(-22% - 32%) | | Respell | 187.05 | (12.2%) | 189.48 | (10.1%) | 1.3%(-18% - 26%) | | MedSloppyPhrase | 583.45 | (11.3%) | 592.32 | (9.9%) | 1.5%(-17% - 25%) | |HighTerm |1114.87 | (12.0%) | 1131.89 | (12.8%) | 1.5%(-20% - 29%) | | HighTermDayOfYearSort | 408.17 | (13.1%) | 416.13 | (9.3%) | 1.9%(-18% - 27%) | |BrowseDayOfYearTaxoFacets|5460.05 | (8.5%) | 5591.96 | (8.0%) | 2.4%(-13% - 20%) | | BrowseMonthTaxoFacets |5490.18 | (8.0%) | 5654.03 | (9.3%) | 3.0%(-13% - 22%) | | LowSloppyPhrase | 562.96 | (10.1%) | 583.91 | (9.5%) | 3.7%(-14% - 25%) | | HighPhrase | 221.20 | (11.9%) | 229.85 | (12.2%) | 3.9%(-17% - 31%) | | OrHighMed | 352.09 | (12.3%) | 369.39 | (9.4%) | 4.9%(-14% - 30%) | |PKLookup | 85.19 | (18.1%) | 135.38 | (22.7%) | 58.9%( 15% - 121%) | > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an _id field > that uniquely identifies it, which
[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance
[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Jiang updated LUCENE-8980: --- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it, which is indexed so that documents can be looked up from Lucene. When users write Elasticsearch with self-generated _id values, even if the conflict rate is very low, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. was: *Description* In Elasticsearch, each document has an _id field that uniquely identifies it, which is indexed so that documents can be looked up from Lucene. When users write Elasticsearch with self-generated _id values, even if the conflict rate is very low, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* 1. Choose a better _id generator before writing ES Different _id formats have a great impact on write performance. We have verified this in production cluster. Users can refer to the following blog and choose a better _id generator. [http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html] 2. Optimise with min/maxTerm metrics in Lucene As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. *Tests* I have made some write benchmark using _id in UUID V1 format, and the benchmark result is as follows: ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed after 8h||CPU cost||Overall improvement|| |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A| |Optimised Lucene|34.5w/s (+15.4%)|63.8 (-6.7%)|+22.1%|31.5w/s (18.0%)|61.5 (-7.7%)|+25.7%| As shown above, after 8 hours of continuous writing, write speed improves by 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. The Elasticsearch GET API and ids query would get similar performance improvements. It should be noted that the benchmark test needs to be run several hours continuously, because the performance improvements is not obvious when the data is completely cached or the number of segments is too small. > Optimise SegmentTermsEnum.seekExact performance > --- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.2 >Reporter: Guoqiang Jiang >Assignee: David Wayne Smiley >Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an _id field > that uniquely identifies it, which is indexed so that documents can be looked > up from Lucene. When users write Elasticsearch with self-generated _id > values, even if the conflict rate is very low, Elasticsearch has to check _id > uniqueness through Lucene API for each document, which result in poor write > performance. > > *Solution* > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check > whether the term fall in the range of minTerm and maxTerm, so that wo skip > some useless segments as soon as possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938389#comment-16938389 ] Bruno Roustant commented on LUCENE-8920: Here is a proposal for the heuristic to select the encoding of a FST node. The idea is to have threshold values that vary according to a parameter, let's call it "timeSpaceBalance". timeSpaceBalance can have 4 values: MORE_COMPACT, COMPACT (default, current FST), FAST, FASTER. Keep the current FST encoding/behavior for COMPACT. Only try open-addressing encoding for the FAST or FASTER balance. Be very demanding for direct-addressing for COMPACT or MORE_COMPACT. Do we need the MORE_COMPACT mode? Even if I don't see the use-case now, since it's easy and does not involve more code to have it, I would say yes. 4 rules, one per possible encoding, ordered from top to bottom. The first encoding which condition matches is selected. n: number of labels (num sub-nodes) depth: depth of the node in the tree [list-encoding] if n <= L1 || (depth >= L2 && n <= L3) [direct-addressing] if n / (max label - min label) >= D1 [try open-addressing] if depth <= O1 || n >= O2 [binary search] otherwise And below are the threshold values for each timeSpaceBalance: timeSpaceBalance = MORE_COMPACT (memory < x1) L1 = 6, L2 = 4, L3 = 11 D1 = 0.8 O1 = -1, O2 = infinite timeSpaceBalance = COMPACT (memory x1) L1 = 4, L2 = 4, L3 = 9 D1 = 0.66 O1 = -1, O2 = infinite timeSpaceBalance = FAST (memory x2 ?) L1 = 4, L2 = 4, L3 = 7 D1 = 0.5 O1 = 3, O2 = 10 timeSpaceBalance = FASTER (memory x3 ?) L1 = 3, L2 = 4, L3 = 5 D1 = 0.33 O1 = infinite, O2 = 0 Thoughts? > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #831: LUCENE-8949: Allow LeafFieldComparators to publish Feature Values
atris commented on issue #831: LUCENE-8949: Allow LeafFieldComparators to publish Feature Values URL: https://github.com/apache/lucene-solr/pull/831#issuecomment-535370985 Hi @jpountz , RE: this PR, I think it is a prerequisite for improvements like https://issues.apache.org/jira/browse/LUCENE-8988 and shared PQ based early termination. I was wondering if we could merge this PR and mark the API as experimental, along with a warning that it could be costly for some specific iterators. WDYT, please? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution
atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution URL: https://github.com/apache/lucene-solr/pull/899#issuecomment-535360687 Any thoughts on this one? Seems safe enough to merge? I plan to merge it in another 12 hours from now -- unless any objections This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org