[jira] [Commented] (LUCENE-8989) IndexSearcher Should Handle Rejection of Concurrent Task

2019-09-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939127#comment-16939127
 ] 

ASF subversion and git services commented on LUCENE-8989:
-

Commit 15db6bfa88952cf0912b3c93d59c0cdc55bf9e2a in lucene-solr's branch 
refs/heads/master from Atri Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=15db6bf ]

LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution (#899)

When executing queries using Executors, we should gracefully handle
the case when Executor rejects a task and run the task on the caller
thread

> IndexSearcher Should Handle Rejection of Concurrent Task
> 
>
> Key: LUCENE-8989
> URL: https://issues.apache.org/jira/browse/LUCENE-8989
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As discussed in [https://github.com/apache/lucene-solr/pull/815,] 
> IndexSearcher should handle the case when the executor rejects the execution 
> of a task (unavailability of threads?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris merged pull request #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution

2019-09-26 Thread GitBox
atris merged pull request #899: LUCENE-8989: Allow IndexSearcher To Handle 
Rejected Execution
URL: https://github.com/apache/lucene-solr/pull/899
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution

2019-09-26 Thread GitBox
atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle 
Rejected Execution
URL: https://github.com/apache/lucene-solr/pull/899#issuecomment-535786132
 
 
   Merging now -- assuming lazy consensus


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #884: LUCENE-8980: optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread GitBox
atris commented on issue #884: LUCENE-8980: optimise SegmentTermsEnum.seekExact 
performance
URL: https://github.com/apache/lucene-solr/pull/884#issuecomment-535782672
 
 
   +1, I think this is a good change. The numbers look fine. My only concern 
was that since this is in the hot path of indexing, additional CPU cycles will 
be spent in performing the check. However, no degradation seems to be reported 
in your benchmarks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Ethan-Zhang commented on issue #884: LUCENE-8980: optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread GitBox
Ethan-Zhang commented on issue #884: LUCENE-8980: optimise 
SegmentTermsEnum.seekExact performance
URL: https://github.com/apache/lucene-solr/pull/884#issuecomment-535771947
 
 
   good work!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] yonik opened a new pull request #903: SOLR-13399: add SPLITSHARD splitByPrefix docs

2019-09-26 Thread GitBox
yonik opened a new pull request #903: SOLR-13399: add SPLITSHARD splitByPrefix 
docs
URL: https://github.com/apache/lucene-solr/pull/903
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

2019-09-26 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-8991:
---
Attachment: LUCENE-8991.patch
Status: Patch Available  (was: Patch Available)

I'll do you one better Erick...

In the updated patch, {{-da:java.util.HashMap}} is used if and only if this 
appears to be openjdk based JVM with spec version 10 or 11 (same logic as used 
in the existing {{documentation-lint.supported}} condition) *and* 
{{tests.asserts.hashmap}} is "false" (or unset) ... so people can override it 
this logic with a single {{-Dtests.asserts.hashmap=true}}.

which means on trunk today:

1) this line fails fairly reliably for me using openjdk 11.0.4...
{noformat}
ant test  -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom 
-Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{noformat}
..but with the patch applied it passes reliably for me, indicating that the 
disabling the assertions on HashMap prevented the failure.

2) with the patch applied *and* the override prop added...
{noformat}
ant test  -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom 
-Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8 -Dtests.asserts.hashmap=true
{noformat}
...it starts to fail reliably for me again, indicating that the override works

> disable java.util.HashMap assertions to avoid spurious vailures due to 
> JDK-8205399
> --
>
> Key: LUCENE-8991
> URL: https://issues.apache.org/jira/browse/LUCENE-8991
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
>  Labels: Java10, Java11
> Attachments: LUCENE-8991.patch, LUCENE-8991.patch
>
>
> An incredibly common class of jenkins failure (at least in Solr tests) stems 
> from triggering assertion failures in java.util.HashMap -- evidently 
> triggering bug JDK-8205399, first introduced in java-10, and fixed in 
> java-12, but has never been backported to any java-10 or java-11 bug fix 
> release...
>https://bugs.openjdk.java.net/browse/JDK-8205399
> SOLR-13653 tracks how this bug can affect Solr users, but I think it would 
> make sense to disable java.util.HashMap in our build system to reduce the 
> confusing failures when users/jenkins runs tests, since there is nothing we 
> can do to work around this when testing with java-11 (or java-10 on branch_8x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-09-26 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939010#comment-16939010
 ] 

Mark Miller commented on SOLR-13796:


So my custom gradle test run spits out tests that don't seem to match their 
annotation, and at this point it spits out 100+ of the below.

All these super fast tests that could often easily break 30, 40+ seconds in CI 
(if it was a short test to begin with).

 
{noformat}
org.apache.solr.handler.PingRequestHandlerTest 3s @Slow  Found very fast test 
annotated as slow!

org.apache.solr.core.SolrCoreTest 5s @Slow  Found very fast test annotated as 
slow!

org.apache.solr.highlight.HighlighterTest 2s @Slow  Found very fast test 
annotated as slow!

org.apache.solr.search.stats.TestExactSharedStatsCache 4s @Slow  Found very 
fast test annotated as slow!

org.apache.solr.core.TestDynamicLoadingUrl 3s @Slow  Found very fast test 
annotated as slow!

org.apache.solr.cloud.rule.RulesTest 7s @Slow  Found very fast test annotated 
as slow!


 {noformat}

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13797) SolrResourceLoader produces inconsistent results when given bad arguments

2019-09-26 Thread Mike Drob (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob updated SOLR-13797:
-
Status: Patch Available  (was: Open)

> SolrResourceLoader produces inconsistent results when given bad arguments
> -
>
> Key: SOLR-13797
> URL: https://issues.apache.org/jira/browse/SOLR-13797
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Attachments: SOLR-13797.v1.patch
>
>
> SolrResourceLoader will attempt to do some magic to infer what the user 
> wanted when loading TokenFilter and Tokenizer classes. However, this can end 
> up putting the wrong class in the cache such that the request succeeds the 
> first time but fails subsequent times. It should either succeed or fail 
> consistently on every call.
> This can be triggered in a variety of ways, but the simplest is maybe by 
> specifying the wrong element type in an indexing chain. Consider the field 
> type definition:
> {code:xml}
> 
>   
> 
>  maxGramSize="2"/>
>   
> 
> {code}
> If loaded by itself (e.g. docker container for standalone validation) then 
> the schema will pass and collection will succeed, with Solr actually figuring 
> out that it needs an {{NGramTokenFilterFactory}}. However, if this is loaded 
> on a cluster with other collections where the {{NGramTokenizerFactory}} has 
> been loaded correctly then we get {{ClassCastException}}. Or if this 
> collection is loaded first then others using the Tokenizer will fail instead.
> I'd argue that succeeding on both calls is the better approach because it 
> does what the user likely wants instead of what the user explicitly asks for, 
> and creates a nicer user experience that is marginally less pedantic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13797) SolrResourceLoader produces inconsistent results when given bad arguments

2019-09-26 Thread Mike Drob (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob updated SOLR-13797:
-
Attachment: SOLR-13797.v1.patch
Status: Open  (was: Open)

This is a patch that allows both calls to succeed by removing the bad value 
from the cache. If somebody has preferences to the strict approach, let me know.

> SolrResourceLoader produces inconsistent results when given bad arguments
> -
>
> Key: SOLR-13797
> URL: https://issues.apache.org/jira/browse/SOLR-13797
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.2, 7.7.2
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Attachments: SOLR-13797.v1.patch
>
>
> SolrResourceLoader will attempt to do some magic to infer what the user 
> wanted when loading TokenFilter and Tokenizer classes. However, this can end 
> up putting the wrong class in the cache such that the request succeeds the 
> first time but fails subsequent times. It should either succeed or fail 
> consistently on every call.
> This can be triggered in a variety of ways, but the simplest is maybe by 
> specifying the wrong element type in an indexing chain. Consider the field 
> type definition:
> {code:xml}
> 
>   
> 
>  maxGramSize="2"/>
>   
> 
> {code}
> If loaded by itself (e.g. docker container for standalone validation) then 
> the schema will pass and collection will succeed, with Solr actually figuring 
> out that it needs an {{NGramTokenFilterFactory}}. However, if this is loaded 
> on a cluster with other collections where the {{NGramTokenizerFactory}} has 
> been loaded correctly then we get {{ClassCastException}}. Or if this 
> collection is loaded first then others using the Tokenizer will fail instead.
> I'd argue that succeeding on both calls is the better approach because it 
> does what the user likely wants instead of what the user explicitly asks for, 
> and creates a nicer user experience that is marginally less pedantic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13797) SolrResourceLoader produces inconsistent results when given bad arguments

2019-09-26 Thread Mike Drob (Jira)
Mike Drob created SOLR-13797:


 Summary: SolrResourceLoader produces inconsistent results when 
given bad arguments
 Key: SOLR-13797
 URL: https://issues.apache.org/jira/browse/SOLR-13797
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 8.2, 7.7.2
Reporter: Mike Drob
Assignee: Mike Drob


SolrResourceLoader will attempt to do some magic to infer what the user wanted 
when loading TokenFilter and Tokenizer classes. However, this can end up 
putting the wrong class in the cache such that the request succeeds the first 
time but fails subsequent times. It should either succeed or fail consistently 
on every call.

This can be triggered in a variety of ways, but the simplest is maybe by 
specifying the wrong element type in an indexing chain. Consider the field type 
definition:

{code:xml}

  


  

{code}

If loaded by itself (e.g. docker container for standalone validation) then the 
schema will pass and collection will succeed, with Solr actually figuring out 
that it needs an {{NGramTokenFilterFactory}}. However, if this is loaded on a 
cluster with other collections where the {{NGramTokenizerFactory}} has been 
loaded correctly then we get {{ClassCastException}}. Or if this collection is 
loaded first then others using the Tokenizer will fail instead.

I'd argue that succeeding on both calls is the better approach because it does 
what the user likely wants instead of what the user explicitly asks for, and 
creates a nicer user experience that is marginally less pedantic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs

2019-09-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938996#comment-16938996
 ] 

ASF subversion and git services commented on SOLR-13747:


Commit e979255ca75bc554a75daeda523bb0b60ade39f2 in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e979255 ]

SOLR-13747: New 
TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs() to give 
people running tests more visibility if/when they use a known-buggy JVM causing 
most SSL tests to silently SKIP

(cherry picked from commit ec9780c8aad7ffbf394d4cbefa772c6ba61650d0)


> 'ant test' should fail on JVM's w/known SSL bugs
> 
>
> Key: SOLR-13747
> URL: https://issues.apache.org/jira/browse/SOLR-13747
> Project: Solr
>  Issue Type: Test
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13747.patch
>
>
> If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has 
> known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this 
> -- so the user knows they should upgrade their JVM (rather then relying on 
> the user to notice that SSL tests were {{SKIP}} ed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

2019-09-26 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938994#comment-16938994
 ] 

Lucene/Solr QA commented on LUCENE-8991:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m  
0s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m  0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m  0s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:black}{color} | {color:black} {color} | {color:black}  0m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-8991 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12981459/LUCENE-8991.patch |
| Optional Tests |  compile  javac  unit  ratsources  validatesourcepatterns  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / ec9780c8aad |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/207/testReport/ |
| modules | C: lucene U: lucene |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/207/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> disable java.util.HashMap assertions to avoid spurious vailures due to 
> JDK-8205399
> --
>
> Key: LUCENE-8991
> URL: https://issues.apache.org/jira/browse/LUCENE-8991
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
>  Labels: Java10, Java11
> Attachments: LUCENE-8991.patch
>
>
> An incredibly common class of jenkins failure (at least in Solr tests) stems 
> from triggering assertion failures in java.util.HashMap -- evidently 
> triggering bug JDK-8205399, first introduced in java-10, and fixed in 
> java-12, but has never been backported to any java-10 or java-11 bug fix 
> release...
>https://bugs.openjdk.java.net/browse/JDK-8205399
> SOLR-13653 tracks how this bug can affect Solr users, but I think it would 
> make sense to disable java.util.HashMap in our build system to reduce the 
> confusing failures when users/jenkins runs tests, since there is nothing we 
> can do to work around this when testing with java-11 (or java-10 on branch_8x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938991#comment-16938991
 ] 

Noble Paul commented on SOLR-13661:
---

bq.Simplicity should be front seat. Don't force users to have to add 
{{package="my-pkg"}} 

 

A name is something everyone needs . A package name is an extremely important 
aspect. You cannot load reload plugins from a package we do not know the name 
of the package. Just a class name means nothing. Isolated classloaders are 
extremely important. Every sensible platform is built with isolation.

 

We can possibly later add a global feature called , {{config-package= mypkg}} . 
This would mean that every plugin will load from {{mypkg}}. There is no reason 
why it cannot be added later. But, not being able to load a plugins from 
multiple packages is a strict NO. You are just ensuring that multiple packages 
from multiple plugin writers cannot coexist.

Another issue id backward incompatibility. The new class loader design is 
different and it can break current deploymemts

This new design totally disallows per core classloaders. This can be a problem 
for users who use Solr with core level libs. So, it has to be backward 
incompatble as well


bq.Robustness during upgrades is another concern. I don't see mentioned in the 
design doc what happens during a Solr upgrade.

 

I'm not sure you have read the design properly. Robustness of design is the 
paramount feature of this design. We went out of our way to ensure that 
"update" is non-disruptive. Rolling restarts are NEVER REQUIRED. Solr behaves 
very badly during rolling restarts, we lose shards/replicas or our overseer get 
clogged with messages and needs manual intervention.
 


Overall feedback on your feedback  [~janhoy] 

I'm happy to receive and address feedback. We need to divide the following parts
 # Enhancements which can be implemented with the current design . But the 
current design is meeting the minimum viable product and does not obstruct or 
hamper implementing an enhancement
 # The current design is suboptimal or bad UX
 # Comments on "inadequate eyes" etc. We are an OSS project and we don't work 
in the same org. So collaboration is always "inadequate" .  We should always 
work towards having 100% collaboration in every single feature that we build. 
The fact that we are discussing this here means that we can have collaboration 
as long as people are interested in it. So, please refrain from giving this 
"feedback". We should keep the comments short and sweet and make them 
"actionable"

 

I'll give examples of #1 and #2
h3. using a Package.java to load plugins

This falls into #1
Is this a good suggestion? I would say it's questionable. But, I won't delve 
into that here.

But, can we implement it if required in the future? Yes, of course. Does it 
have to be discussed in this ticket? I would say it's out of the scope.

h3. Changing the blob names to hash-filename

This falls into #2

This is a usability issue. I had thought about this while building it, ab had 
raised the concern  in he design doc. It is probably something we should 
address before committing this. If you have gone through the comments you would 
have seen it. 

 
Let's keep the feedback coming. But let's not distract from the current task at 
hand. We don't have a working useful package management system today and a 
minimum viable product is what we needs now. bells and whistles can be added 
later. 

We need the wheels, steering, transmission and seats of the car to be built 
first. Yes , we can add seat belts , Air conditioning etc later.

 

> A package management system for Solr
> 
>
> Key: SOLR-13661
> URL: https://issues.apache.org/jira/browse/SOLR-13661
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: package
> Attachments: plugin-usage.png, repos.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Here's the design doc:
> https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938982#comment-16938982
 ] 

Jan Høydahl edited comment on SOLR-13661 at 9/26/19 9:35 PM:
-

Thanks for all the careful responses Ishan, much of what you write makes a lot 
of sense. We disagree on others.

You failed to address my point 5 on version dependency conflicts when (rolling) 
upgrading Solr.
{quote}We have a disagreement here. Needing the user to edit solrconfig.xml 
necessarily in order to specify which collection uses which package is bad from 
simplicity standpoint. Please keep in mind that hand editing configset is an 
expert feature. No, we're not forcing regular users to do anything extra. 
Regular users should be using config APIs to register/deregister their plugins. 
Expert users are expert enough to just add an extra package name while they are 
hand editing their solrconfig.xml. I think this is a very reasonable compromise.
{quote}
My concern is partly ease of use, but also keeping a config set coherent. The 
dependency information that a certain collection will not work without e.g. 
Kuromoji analyzer, ICU4J and BMax Query parser belongs there with the config 
set, so that we can alert user when creating collection if a dependency is not 
met. Thus support for foo tags (or a variant of lib tag as 
David suggested) in solrconfig makes perfect sense. Collections are sometimes 
moved/copied between clusters. You could install a new cluster fresh, restore 
from backup etc. I'm not against a bin/solr package deploy command, but that 
command should then add the  tag to the collection(s), not to some 
global config.

13. I thought of another case, namely backup/restore. I suppose that all 
package data from ZK will be backed up and restored. But will blobs be part of 
the backup? If not, how would a restore make sure that the blob store is 
re-populated? What about CDCR? Will it sync packages and blobs? What if the 
other cluster has another Solr version? :) 

I'd also vote for targeting 9.0 instead of 8.x. It would make for a great 
killer feature. To get the feature out in people's hands we could do an 
9.0.0-beta release this fall, while continuing to release 8.4, 8.5 etc 
afterwards. Lots of people would start using the alpha release, including 3rd 
party plugin developers, and we'd get tons of feedback and time to adjust and 
stabilize the APIs. The 9.0.0-beta release could also be a public beta for the 
Gradle build which is another thing users (and integrators not the least) needs 
to adjust to.


was (Author: janhoy):
Thanks for all the careful responses Ishan, much of what you write makes a lot 
of sense. We disagree on others.

You failed to address my point 5 on version dependency conflicts when (rolling) 
upgrading Solr.
{quote}We have a disagreement here. Needing the user to edit solrconfig.xml 
necessarily in order to specify which collection uses which package is bad from 
simplicity standpoint. Please keep in mind that hand editing configset is an 
expert feature. No, we're not forcing regular users to do anything extra. 
Regular users should be using config APIs to register/deregister their plugins. 
Expert users are expert enough to just add an extra package name while they are 
hand editing their solrconfig.xml. I think this is a very reasonable compromise.
{quote}
My concern is partly ease of use, but also keeping a config set coherent. The 
dependency information that a certain collection will not work without e.g. 
Kuromoji analyzer, ICU4J and BMax Query parser belongs there with the config 
set, so that we can alert user when creating collection if a dependency is not 
met. Thus support for foo tags (or a variant of lib tag as 
David suggested) in solrconfig makes perfect sense. Collections are sometimes 
moved/copied between clusters. You could install a new cluster fresh, restore 
from backup etc. I'm not against a bin/solr package deploy command, but that 
command should then add the  tag to the collection(s), not to some 
global config.

13. I thought of another case, namely backup/restore. I suppose that all 
package data from ZK will be backed up and restored. But will blobs be part of 
the backup? If not, how would a restore make sure that the blob store is 
re-populated?

I'd also vote for targeting 9.0 instead of 8.x. It would make for a great 
killer feature. To get the feature out in people's hands we could do an 
9.0.0-beta release this fall, while continuing to release 8.4, 8.5 etc 
afterwards. Lots of people would start using the alpha release, including 3rd 
party plugin developers, and we'd get tons of feedback and time to adjust and 
stabilize the APIs. The 9.0.0-beta release could also be a public beta for the 
Gradle build which is another thing users (and integrators not the least) needs 
to adjust to.

> A package management system for Solr
> 

[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-09-26 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938976#comment-16938976
 ] 

Mark Miller commented on SOLR-13796:


Here is a sample of the knobs we have currently, I'm sure there are others - we 
want to get tight control of these so we can have fast, consistent runs and 
runs not meant to be developer interactive (Nightly).
{noformat}
System.setProperty("solr.tests.IntegerFieldType", "solr.IntPointField");
System.setProperty("solr.tests.FloatFieldType", "solr.FloatPointField");
System.setProperty("solr.tests.LongFieldType", "solr.LongPointField");
System.setProperty("solr.tests.DoubleFieldType", "solr.DoublePointField");
System.setProperty("solr.tests.DateFieldType", "solr.DatePointField");
System.setProperty("solr.tests.EnumFieldType", "solr.EnumFieldType");
System.setProperty("solr.tests.numeric.dv", "true");
System.setProperty("solr.tests.numeric.points", "true");
System.setProperty("solr.tests.numeric.points.dv", "true");
System.setProperty("solr.iterativeMergeExecIdleTime", "1000");
System.setProperty("zookeeper.forceSync", "false");
System.setProperty("solr.zkclienttimeout", "9"); 
System.setProperty("solr.httpclient.retries", "1");
System.setProperty("solr.retries.on.forward", "1");
System.setProperty("solr.retries.to.followers", "1");
System.setProperty("solr.v2RealPath", "true");
System.setProperty("zookeeper.forceSync", "no");
System.setProperty("jetty.testMode", "true");
System.setProperty("enable.update.log", usually() ? "true" : "false");
System.setProperty("tests.shardhandler.randomSeed", 
Long.toString(random().nextLong()));
System.setProperty("solr.clustering.enabled", "false");
System.setProperty("solr.peerSync.useRangeVersions", 
String.valueOf(random().nextBoolean()));
System.setProperty("solr.cloud.wait-for-updates-with-stale-state-pause", "500");
System.setProperty(ZK_WHITELIST_PROPERTY, "*");
DirectUpdateHandler2.commitOnClose = false; // other tests turn this off and 
try to reset it - we use sys prop below to override
System.setProperty("tests.disableHdfs", "true");
System.setProperty("solr.maxContainerThreads", "20");
System.setProperty("solr.lowContainerThreadsThreshold", "-1");
System.setProperty("solr.minContainerThreads", "0");
System.setProperty("solr.containerThreadsIdle", "3");
System.setProperty("evictIdleConnections", "2");
System.setProperty("solr.commitOnClose", "false"); // can make things quite slow
System.setProperty("solr.codec", "solr.SchemaCodecFactory");
System.setProperty("tests.COMPRESSION_MODE", "BEST_COMPRESSION");
System.setProperty("tests.skipSetupCodec", "true");
System.setProperty("solr.lock.type", "single");
System.setProperty("solr.tests.lockType", "single");
System.setProperty("solr.tests.mergePolicyFactory", 
"org.apache.solr.index.NoMergePolicyFactory");
System.setProperty("solr.tests.mergeScheduler", 
"org.apache.lucene.index.ConcurrentMergeScheduler");
System.setProperty("solr.mscheduler", 
"org.apache.lucene.index.ConcurrentMergeScheduler");
System.setProperty("bucketVersionLockTimeoutMs", "8000");
System.setProperty("socketTimeout", "3");
System.setProperty("connTimeout", "1");
System.setProperty("solr.cloud.wait-for-updates-with-stale-state-pause", "0");
System.setProperty("solr.cloud.starting-recovery-delay-milli-seconds", "0");
System.setProperty("lucene.cms.override_core_count", "2");
System.setProperty("lucene.cms.override_spins", "false");
System.setProperty("solr.tests.maxBufferedDocs", "100");
System.setProperty("solr.tests.ramBufferSizeMB", "20");
System.setProperty("solr.tests.ramPerThreadHardLimitMB", "4");
System.setProperty("managed.schema.mutable", "false");
System.setProperty("solr.disableJvmMetrics", "true");
System.setProperty("useCompoundFile", "false");
System.setProperty("prepRecoveryReadTimeoutExtraWait", "2000");
System.setProperty("evictIdleConnections", "3");
System.setProperty("validateAfterInactivity", "-1");
System.setProperty("leaderVoteWait", "1000");
System.setProperty("leaderConflictResolveWait", "1000");
System.setProperty("solr.recovery.recoveryThrottle", "1000");
System.setProperty("solr.recovery.leaderThrottle", "500");
System.setProperty("solr.cloud.wait-for-updates-with-stale-state-pause", "0");
System.setProperty("solr.httpclient.retries", "1");
System.setProperty("solr.retries.on.forward", "1");
System.setProperty("solr.retries.to.followers", "1"); 
useFactory("solr.RAMDirectoryFactory");
{noformat}

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 

[jira] [Commented] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs

2019-09-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938974#comment-16938974
 ] 

ASF subversion and git services commented on SOLR-13747:


Commit ec9780c8aad7ffbf394d4cbefa772c6ba61650d0 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ec9780c ]

SOLR-13747: New 
TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs() to give 
people running tests more visibility if/when they use a known-buggy JVM causing 
most SSL tests to silently SKIP


> 'ant test' should fail on JVM's w/known SSL bugs
> 
>
> Key: SOLR-13747
> URL: https://issues.apache.org/jira/browse/SOLR-13747
> Project: Solr
>  Issue Type: Test
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13747.patch
>
>
> If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has 
> known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this 
> -- so the user knows they should upgrade their JVM (rather then relying on 
> the user to notice that SSL tests were {{SKIP}} ed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-09-26 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938973#comment-16938973
 ] 

Mark Miller commented on SOLR-13796:


Another thing I have to wrap up but pretty well covered, is collecting all of 
our system properties into the common base class. You cannot tell what is set 
now, what can be set, stuff is littered everywhere.

I'd like to try and consolidate everything important in the base class, setting 
them all efficiently for non Nightly runs and more randomly and thoroughly for 
Nightly runs.

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

2019-09-26 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938955#comment-16938955
 ] 

Erick Erickson commented on LUCENE-8991:


Anything that improves test stability is welcome of course

 

May I ask that you add the fact that it's been fixed in Java 12 to the comment 
in the code? Just so 3 years from now when someone looks at it after we go to 
JDK12+ as a minimum requirement it's obvious that it can be removed.

> disable java.util.HashMap assertions to avoid spurious vailures due to 
> JDK-8205399
> --
>
> Key: LUCENE-8991
> URL: https://issues.apache.org/jira/browse/LUCENE-8991
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
>  Labels: Java10, Java11
> Attachments: LUCENE-8991.patch
>
>
> An incredibly common class of jenkins failure (at least in Solr tests) stems 
> from triggering assertion failures in java.util.HashMap -- evidently 
> triggering bug JDK-8205399, first introduced in java-10, and fixed in 
> java-12, but has never been backported to any java-10 or java-11 bug fix 
> release...
>https://bugs.openjdk.java.net/browse/JDK-8205399
> SOLR-13653 tracks how this bug can affect Solr users, but I think it would 
> make sense to disable java.util.HashMap in our build system to reduce the 
> confusing failures when users/jenkins runs tests, since there is nothing we 
> can do to work around this when testing with java-11 (or java-10 on branch_8x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-09-26 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938948#comment-16938948
 ] 

Mark Miller commented on SOLR-13796:


So closing things has been a problem for a few reasons, especially when we have 
a lot of items to close.
 * Often the logic of what exceptions to ignore and ensuring proper flow 
through a close is complicated and buggy.
 * Any object in our complicated graph taking a long time to close can greatly 
affect other components * N and really slow down tests.
 * Having to wait for slow closes hides bugs and problems and monsters.
 * Often closes are implemented inefficiently, allowing greater time for ugly 
slow interactions when the system is in a partial closed state to kick off.

In try to solve this issue I've created a new SmartClose class, you use it like:

 
{noformat}
try (SmartClose closer = new SmartClose(this) {
   closer.add(object1);
   closer.add("Group1", object2, exectutor1);
   closer.add("Group2", object3, () -> { weirdObject.shutdown(); return 
weirdObject; } );
}
 {noformat}
 
When the SmartClose object closes, it will close each 'add' work group in 
parallel within the add but each add handled in order.

So you don't have to handle much null logic, pass them straight to add null or 
not, you don't to have to handle exception logic, or handle efficiency logic. 
You just specify what has to be closed or shutdown, what can be don in parallel 
 or has to be done in order (each add call) and the rest is handed for you.

You also get nice automatic tracking so know how long closes are taking and 
what part of them is slow, eg:

{noformat}
org.apache.solr.core.CoreContainer 155ms
  :  ZkContainer 44ms
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor 0ms
org.apache.solr.cloud.ZkController 0ms
  :  workExecutor & replayUpdateExec 1ms
org.apache.solr.common.util.OrderedExecutor 0ms
com.codahale.metrics.InstrumentedExecutorService 0ms
  :  MetricsHistory & WaitForSolrCores 93ms
org.apache.solr.core.SolrCores 93ms
org.apache.solr.handler.admin.MetricsHistoryHandler 0ms
org.apache.solr.client.solrj.impl.CloudSolrClient 0ms
  :  Metrics reporters & guages 11ms
org.apache.solr.metrics.SolrMetricManager:REP:NODE 11ms
org.apache.solr.metrics.SolrMetricManager:REP:JVM 2ms
org.apache.solr.metrics.SolrMetricManager:REP:JETTY 3ms
org.apache.solr.metrics.SolrMetricManager:GA:JVM 0ms
org.apache.solr.metrics.SolrMetricManager:GA:NODE 3ms
org.apache.solr.metrics.SolrMetricManager:GA:JETTY 0ms
  :  Final Items 2ms
org.apache.solr.handler.component.HttpShardHandlerFactory 0ms
org.apache.solr.update.UpdateShardHandler 0ms
org.apache.solr.core.SolrResourceLoader 0ms
org.apache.solr.metrics.SolrMetricManager:REP:CLUSTER 0ms
org.apache.solr.handler.admin.CoreAdminHandler 0ms
{noformat}

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-26 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938935#comment-16938935
 ] 

Michael Sokolov commented on LUCENE-8920:
-

> Here is a proposal for the heuristic to select the encoding of a FST node.

I like the overall structure of the proposal. I'm unsure about the proposed 
levels. For example, I believe the current FST does not have D1=0.66 rather it 
is 0.25. I'm not saying that's the _right_ choice, merely that I think it is 
what we have on master, and the discrepancy here makes me wonder if I'm reading 
the proposal correctly. How did you come up with the 0.66 number?

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938881#comment-16938881
 ] 

Ishan Chattopadhyaya commented on SOLR-13661:
-

bq. Let's understand that this is not our hobby. It's a job. We are all able to 
do this because somebody is funding the development .When somebody is funding 
the development, they will have certain requirements and all their requirements 
need to be met. I'm sure every org works like that. Salesforce, Bloomberg, 
Apple , Lucidworks and all the big contributors are building big features to 
satisfy their businesses.
I disagree with the spirit of this comment. I'd have done this work, 
irrespective of someone funding this development. For me, it is as much a hobby 
as much a job.

> A package management system for Solr
> 
>
> Key: SOLR-13661
> URL: https://issues.apache.org/jira/browse/SOLR-13661
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: package
> Attachments: plugin-usage.png, repos.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Here's the design doc:
> https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863
 ] 

Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:55 PM:
--

1.
{quote}too many decisions seem to be made with too few eyes 
{quote}
Noble, Ishan, Jan, Andrzej, David, Erick. We've had at least these many "eyes". 
How many more or who else are needed?

2.
{quote}"package" concept seems to be designed for ONE use case only, customer's 
internal custom packages, with arbitrary local naming of repos and packages
{quote}
Not at all. Apache's contrib modules can easily be installed/deployed the same 
way, they can be packages themselves. End of the day, a "package" is just a jar 
and some metadata. A repository is just a named location containing packages. 
There is no "arbitrary local naming of repos and packages", those aspects are 
left to the plugin writer (which can be Apache as well in case of official 
contrib modules).
{quote}before such a feature goes mainstream, the design should also include 
converting some of our contrib modules to packages that we release as separate 
binaries in the mirrors, and enable an "apache" Repo as default.
{quote}
Converting a contrib module needn't be a precursor to releasing this feature. 
Moreover, without the Gradle work completed, I don't want to attempt to change 
the build system too much (only to do it again later). The system design 
supports us doing so, and it can be taken up later.
{quote}Perhaps that would mean some name spacing or name collision resolution
{quote}
Namespacing the packages is something we thought of. We already have two pieces 
of information available in public APIs and all system boundaries/interfaces: 
"repository" and "package-name". If we need to internally namespace the 
packages, we can do so later. However, I don't think we need to do this: look 
at dnf, apt etc. systems, and there's no concept of namespaces of packages. 
That means, third parties should be careful enough not to name their packages 
same as official packages.

3. Sure.
{quote}We need a plan for how 3rd party plugin developers can publish their 
plugins on their own web site or on GitHub in a well defined way.
{quote}
I can add a document to that effect in our ref-guide. Initially, just the 
repository structure documented in the design document will be supported. 
Github support can be added subsequently.

4.
{quote}Hot/Cold deploy. I don't like systems where you, as part of the install 
need to spin up a server.
{quote}
Spinning up a cluster prior to installing the packages is not "needed". Someone 
can cold start a cluster with plugins pre-installed. Noble and I have both 
documented those steps in the design doc as a reply to your comment.

6. That's a tradeoff. I initially raised the same point, but Noble suggested 
that adding a new set of znode watchers per collection is an overhead. I'm +0 
on your suggestion.

7. We have a disagreement here. Needing the user to edit solrconfig.xml 
necessarily in order to specify which collection uses which package is bad from 
simplicity standpoint. Please keep in mind that hand editing configset is an 
expert feature.

8. No, we're not forcing regular users to do anything extra. Regular users 
should be using config APIs to register/deregister their plugins. Expert users 
are expert enough to just add an extra package name while they are hand editing 
their solrconfig.xml. I think this is a very reasonable compromise.

9. Plan is to support both: (a) jar without manifest + external manifest, (b) 
jar containing manifest.
 The first scenario should be preferred in case of multi-jar packages.

10. Sure, I like the idea of supporting multiple jars as well as a zip 
containing all the jars. It can be supported.

11. Plugin initialisation commands are not complex at all; they are typically 
just regular config API commands to register the plugins. They are necessary 
for pleasant adoption. A user just needs to install a package and specify which 
collection he needs to deploy his package to. Simple! If we go by your idea, 
then user would install the package but need to hand-edit the solrconfig.xml in 
order to add the plugins from the package to his collection.

12. I agree. We thought a lot about how to support filenames, and all the 
approaches seemed to have some deficiency. I've documented that as a comment in 
the design document. Without an additional distributed KV datastore, this seems 
hard to get right. The sha256.properties file approach will not work well, 
since every node will have its own version of the file, and also maintaining 
update consistency will be hard.
{quote}But I put a lot of careful thought into the POC which I feel is largely 
lacking here
{quote}
I assure you that we've put a lot of thought into how we designed this. We went 
through deployment lifecycles of at 

[jira] [Commented] (SOLR-13105) A visual guide to Solr Math Expressions and Streaming Expressions

2019-09-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938877#comment-16938877
 ] 

ASF subversion and git services commented on SOLR-13105:


Commit 17b2308a17532202b62fee4234f4ed05703870e8 in lucene-solr's branch 
refs/heads/SOLR-13105-visual from Joel Bernstein
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=17b2308 ]

SOLR-13105: Update dsp docs 3


> A visual guide to Solr Math Expressions and Streaming Expressions
> -
>
> Key: SOLR-13105
> URL: https://issues.apache.org/jira/browse/SOLR-13105
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: Screen Shot 2019-01-14 at 10.56.32 AM.png, Screen Shot 
> 2019-02-21 at 2.14.43 PM.png, Screen Shot 2019-03-03 at 2.28.35 PM.png, 
> Screen Shot 2019-03-04 at 7.47.57 PM.png, Screen Shot 2019-03-13 at 10.47.47 
> AM.png, Screen Shot 2019-03-30 at 6.17.04 PM.png
>
>
> Visualization is now a fundamental element of Solr Streaming Expressions and 
> Math Expressions. This ticket will create a visual guide to Solr Math 
> Expressions and Solr Streaming Expressions that includes *Apache Zeppelin* 
> visualization examples.
> It will also cover using the JDBC expression to *analyze* and *visualize* 
> results from any JDBC compliant data source.
> Intro from the guide:
> {code:java}
> Streaming Expressions exposes the capabilities of Solr Cloud as composable 
> functions. These functions provide a system for searching, transforming, 
> analyzing and visualizing data stored in Solr Cloud collections.
> At a high level there are four main capabilities that will be explored in the 
> documentation:
> * Searching, sampling and aggregating results from Solr.
> * Transforming result sets after they are retrieved from Solr.
> * Analyzing and modeling result sets using probability and statistics and 
> machine learning libraries.
> * Visualizing result sets, aggregations and statistical models of the data.
> {code}
>  
> A few sample visualizations are attached to the ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13105) A visual guide to Solr Math Expressions and Streaming Expressions

2019-09-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938876#comment-16938876
 ] 

ASF subversion and git services commented on SOLR-13105:


Commit cc635233a614f62845f6361afbf9edb102bf3a04 in lucene-solr's branch 
refs/heads/SOLR-13105-visual from Joel Bernstein
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cc63523 ]

SOLR-13105: Update dsp docs 2


> A visual guide to Solr Math Expressions and Streaming Expressions
> -
>
> Key: SOLR-13105
> URL: https://issues.apache.org/jira/browse/SOLR-13105
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: Screen Shot 2019-01-14 at 10.56.32 AM.png, Screen Shot 
> 2019-02-21 at 2.14.43 PM.png, Screen Shot 2019-03-03 at 2.28.35 PM.png, 
> Screen Shot 2019-03-04 at 7.47.57 PM.png, Screen Shot 2019-03-13 at 10.47.47 
> AM.png, Screen Shot 2019-03-30 at 6.17.04 PM.png
>
>
> Visualization is now a fundamental element of Solr Streaming Expressions and 
> Math Expressions. This ticket will create a visual guide to Solr Math 
> Expressions and Solr Streaming Expressions that includes *Apache Zeppelin* 
> visualization examples.
> It will also cover using the JDBC expression to *analyze* and *visualize* 
> results from any JDBC compliant data source.
> Intro from the guide:
> {code:java}
> Streaming Expressions exposes the capabilities of Solr Cloud as composable 
> functions. These functions provide a system for searching, transforming, 
> analyzing and visualizing data stored in Solr Cloud collections.
> At a high level there are four main capabilities that will be explored in the 
> documentation:
> * Searching, sampling and aggregating results from Solr.
> * Transforming result sets after they are retrieved from Solr.
> * Analyzing and modeling result sets using probability and statistics and 
> machine learning libraries.
> * Visualizing result sets, aggregations and statistical models of the data.
> {code}
>  
> A few sample visualizations are attached to the ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13105) A visual guide to Solr Math Expressions and Streaming Expressions

2019-09-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938872#comment-16938872
 ] 

ASF subversion and git services commented on SOLR-13105:


Commit e1feb24c5e4a189b7c1cbbc2e2ee0523891dbe6f in lucene-solr's branch 
refs/heads/SOLR-13105-visual from Joel Bernstein
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e1feb24 ]

SOLR-13105: Update dsp docs


> A visual guide to Solr Math Expressions and Streaming Expressions
> -
>
> Key: SOLR-13105
> URL: https://issues.apache.org/jira/browse/SOLR-13105
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: Screen Shot 2019-01-14 at 10.56.32 AM.png, Screen Shot 
> 2019-02-21 at 2.14.43 PM.png, Screen Shot 2019-03-03 at 2.28.35 PM.png, 
> Screen Shot 2019-03-04 at 7.47.57 PM.png, Screen Shot 2019-03-13 at 10.47.47 
> AM.png, Screen Shot 2019-03-30 at 6.17.04 PM.png
>
>
> Visualization is now a fundamental element of Solr Streaming Expressions and 
> Math Expressions. This ticket will create a visual guide to Solr Math 
> Expressions and Solr Streaming Expressions that includes *Apache Zeppelin* 
> visualization examples.
> It will also cover using the JDBC expression to *analyze* and *visualize* 
> results from any JDBC compliant data source.
> Intro from the guide:
> {code:java}
> Streaming Expressions exposes the capabilities of Solr Cloud as composable 
> functions. These functions provide a system for searching, transforming, 
> analyzing and visualizing data stored in Solr Cloud collections.
> At a high level there are four main capabilities that will be explored in the 
> documentation:
> * Searching, sampling and aggregating results from Solr.
> * Transforming result sets after they are retrieved from Solr.
> * Analyzing and modeling result sets using probability and statistics and 
> machine learning libraries.
> * Visualizing result sets, aggregations and statistical models of the data.
> {code}
>  
> A few sample visualizations are attached to the ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938870#comment-16938870
 ] 

Ishan Chattopadhyaya commented on SOLR-13661:
-

bq. We went through deployment lifecycles of at least 3 of my clients, went 
through state of the art package management systems like apt, dnf etc. and 
thought through all potential usecases and future of Solr as a lean core with 
all non-essential features stripped out as packages in an Apache repository), 
as well as went through your talk+code+presentation+document. Our main focus is 
ease of use and a robust plugin lifecycle management experience. Your PoC was 
lacking some major pieces that we've covered meticulously here: security, 
efficient loading (without needing to restart nodes as in your PoC), ease of 
deployment, not requiring packages to depend on PF4J (your PoC forced users to 
add PF4J as a dependency, perhaps to facilitate the loading) etc.

Also, went through some bits of the user experience of ES plugin system. 
[~jpountz], [~jim.ferenczi], [~mikemccand], would love your thoughts on this, 
based on your experience with ES.

> A package management system for Solr
> 
>
> Key: SOLR-13661
> URL: https://issues.apache.org/jira/browse/SOLR-13661
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: package
> Attachments: plugin-usage.png, repos.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Here's the design doc:
> https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863
 ] 

Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:28 PM:
--

1.
{quote}too many decisions seem to be made with too few eyes 
{quote}
Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How 
many more or who else are needed?

2.
{quote}"package" concept seems to be designed for ONE use case only, customer's 
internal custom packages, with arbitrary local naming of repos and packages
{quote}
Not at all. Apache's contrib modules can easily be installed/deployed the same 
way, they can be packages themselves. End of the day, a "package" is just a jar 
and some metadata. A repository is just a named location containing packages. 
There is no "arbitrary local naming of repos and packages", those aspects are 
left to the plugin writer (which can be Apache as well in case of official 
contrib modules).
{quote}before such a feature goes mainstream, the design should also include 
converting some of our contrib modules to packages that we release as separate 
binaries in the mirrors, and enable an "apache" Repo as default.
{quote}
Converting a contrib module needn't be a precursor to releasing this feature. 
Moreover, without the Gradle work completed, I don't want to attempt to change 
the build system too much (only to do it again later). The system design 
supports us doing so, and it can be taken up later.
{quote}Perhaps that would mean some name spacing or name collision resolution
{quote}
Namespacing the packages is something we thought of. We already have two pieces 
of information available in public APIs and all system boundaries/interfaces: 
"repository" and "package-name". If we need to internally namespace the 
packages, we can do so later. However, I don't think we need to do this: look 
at dnf, apt etc. systems, and there's no concept of namespaces of packages. 
That means, third parties should be careful enough not to name their packages 
same as official packages.

3. Sure.
{quote}We need a plan for how 3rd party plugin developers can publish their 
plugins on their own web site or on GitHub in a well defined way.
{quote}
I can add a document to that effect in our ref-guide. Initially, just the 
repository structure documented in the design document will be supported. 
Github support can be added subsequently.

4.
{quote}Hot/Cold deploy. I don't like systems where you, as part of the install 
need to spin up a server.
{quote}
Spinning up a cluster prior to installing the packages is not "needed". Someone 
can cold start a cluster with plugins pre-installed. Noble and I have both 
documented those steps in the design doc as a reply to your comment.

6. That's a tradeoff. I initially raised the same point, but Noble suggested 
that adding a new set of znode watchers per collection is an overhead. I'm +0 
on your suggestion.

7. We have a disagreement here. Needing the user to edit solrconfig.xml 
necessarily in order to specify which collection uses which package is bad from 
simplicity standpoint. Please keep in mind that hand editing configset is an 
expert feature.

8. No, we're not forcing regular users to do anything extra. Regular users 
should be using config APIs to register/deregister their plugins. Expert users 
are expert enough to just add an extra package name while they are hand editing 
their solrconfig.xml. I think this is a very reasonable compromise.

9. Plan is to support both: (a) jar without manifest + external manifest, (b) 
jar containing manifest.
 The first scenario should be preferred in case of multi-jar packages.

10. Sure, I like the idea of supporting multiple jars as well as a zip 
containing all the jars. It can be supported.

11. Plugin initialisation commands are not complex at all; they are typically 
just regular config API commands to register the plugins. They are necessary 
for pleasant adoption. A user just needs to install a package and specify which 
collection he needs to deploy his package to. Simple! If we go by your idea, 
then user would install the package but need to hand-edit the solrconfig.xml in 
order to add the plugins from the package to his collection.

12. I agree. We thought a lot about how to support filenames, and all the 
approaches seemed to have some deficiency. I've documented that as a comment in 
the design document. Without an additional distributed KV datastore, this seems 
hard to get right. The sha256.properties file approach will not work well, 
since every node will have its own version of the file, and also maintaining 
update consistency will be hard.
{quote}But I put a lot of careful thought into the POC which I feel is largely 
lacking here
{quote}
I assure you that we've put a lot of thought into how we designed this. We went 
through deployment lifecycles of at least 3 

[jira] [Comment Edited] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863
 ] 

Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:24 PM:
--

1.
{quote}too many decisions seem to be made with too few eyes 
{quote}
Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How 
many more or who else are needed?

2.
{quote}"package" concept seems to be designed for ONE use case only, customer's 
internal custom packages, with arbitrary local naming of repos and packages
{quote}
Not at all. Apache's contrib modules can easily be installed/deployed the same 
way, they can be packages themselves. End of the day, a "package" is just a jar 
and some metadata. A repository is just a named location containing packages. 
There is no "arbitrary local naming of repos and packages", those aspects are 
left to the plugin writer (which can be Apache as well in case of official 
contrib modules).
{quote}before such a feature goes mainstream, the design should also include 
converting some of our contrib modules to packages that we release as separate 
binaries in the mirrors, and enable an "apache" Repo as default.
{quote}
Converting a contrib module needn't be a precursor to releasing this feature. 
Moreover, without the Gradle work completed, I don't want to attempt to change 
the build system too much (only to do it again later). The system design 
supports us doing so, and it can be taken up later.
{quote}Perhaps that would mean some name spacing or name collision resolution
{quote}
Namespacing the packages is something we thought of. We already have two pieces 
of information available in public APIs and all system boundaries/interfaces: 
"repository" and "package-name". If we need to internally namespace the 
packages, we can do so later. However, I don't think we need to do this: look 
at dnf, apt etc. systems, and there's no concept of namespaces of packages. 
That means, third parties should be careful enough not to name their packages 
same as official packages.

3. Sure.
{quote}We need a plan for how 3rd party plugin developers can publish their 
plugins on their own web site or on GitHub in a well defined way.
{quote}
I can add a document to that effect in our ref-guide. Initially, just the 
repository structure documented in the design document will be supported. 
Github support can be added subsequently.

4.
{quote}Hot/Cold deploy. I don't like systems where you, as part of the install 
need to spin up a server.
{quote}
Spinning up a cluster prior to installing the packages is not "needed". Someone 
can cold start a cluster with plugins pre-installed. Noble and I have both 
documented those steps in the design doc as a reply to your comment.

6. That's a tradeoff. I initially raised the same point, but Noble suggested 
that adding a new set of znode watchers per collection is an overhead. I'm +0 
on your suggestion.

7. We have a disagreement here. Needing the user to edit solrconfig.xml 
necessarily in order to specify which collection uses which package is bad from 
simplicity standpoint. Please keep in mind that hand editing configset is an 
expert feature.

8. No, we're not forcing regular users to do anything extra. Regular users 
should be using config APIs to register/deregister their plugins. Expert users 
are expert enough to just add an extra package name while they are hand editing 
their solrconfig.xml. I think this is a very reasonable compromise.

9. Plan is to support both: (a) jar without manifest + external manifest, (b) 
jar containing manifest.
 The first scenario should be preferred in case of multi-jar packages.

10. Sure, I like the idea of supporting multiple jars as well as a zip 
containing all the jars. It can be supported.

11. Plugin initialisation commands are not complex at all; they are typically 
just regular config API commands to register the plugins. They are necessary 
for pleasant adoption. A user just needs to install a package and specify which 
collection he needs to deploy his package to. Simple! If we go by your idea, 
then user would install the package but need to hand-edit the solrconfig.xml in 
order to add the plugins from the package to his collection.

12. I agree. We thought a lot about how to support filenames, and all the 
approaches seemed to have some deficiency. I've documented that as a comment in 
the design document. Without an additional distributed KV datastore, this seems 
hard to get right. The sha256.properties file approach will not work well, 
since every node will have its own version of the file, and also maintaining 
update consistency will be hard.
{quote}But I put a lot of careful thought into the POC which I feel is largely 
lacking here
{quote}
I assure you that we've put a lot of thought into how we designed this. Our 
main focus is ease of use and a robust plugin 

[jira] [Comment Edited] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863
 ] 

Ishan Chattopadhyaya edited comment on SOLR-13661 at 9/26/19 6:24 PM:
--

1.
{quote}too many decisions seem to be made with too few eyes 
{quote}
Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How 
many more or who else are needed?

2.
{quote}"package" concept seems to be designed for ONE use case only, customer's 
internal custom packages, with arbitrary local naming of repos and packages
{quote}
Not at all. Apache's contrib modules can easily be installed/deployed the same 
way, they can be packages themselves. End of the day, a "package" is just a jar 
and some metadata. A repository is just a named location containing packages. 
There is no "arbitrary local naming of repos and packages", those aspects are 
left to the plugin writer (which can be Apache as well in case of official 
contrib modules).
{quote}before such a feature goes mainstream, the design should also include 
converting some of our contrib modules to packages that we release as separate 
binaries in the mirrors, and enable an "apache" Repo as default.
{quote}
Converting a contrib module needn't be a precursor to releasing this feature. 
Moreover, without the Gradle work completed, I don't want to attempt to change 
the build system too much (only to do it again later).
{quote}Perhaps that would mean some name spacing or name collision resolution
{quote}
Namespacing the packages is something we thought of. We already have two pieces 
of information available in public APIs and all system boundaries/interfaces: 
"repository" and "package-name". If we need to internally namespace the 
packages, we can do so later. However, I don't think we need to do this: look 
at dnf, apt etc. systems, and there's no concept of namespaces of packages. 
That means, third parties should be careful enough not to name their packages 
same as official packages.

3. Sure.
{quote}We need a plan for how 3rd party plugin developers can publish their 
plugins on their own web site or on GitHub in a well defined way.
{quote}
I can add a document to that effect in our ref-guide. Initially, just the 
repository structure documented in the design document will be supported. 
Github support can be added subsequently.

4.
{quote}Hot/Cold deploy. I don't like systems where you, as part of the install 
need to spin up a server.
{quote}
Spinning up a cluster prior to installing the packages is not "needed". Someone 
can cold start a cluster with plugins pre-installed. Noble and I have both 
documented those steps in the design doc as a reply to your comment.

6. That's a tradeoff. I initially raised the same point, but Noble suggested 
that adding a new set of znode watchers per collection is an overhead. I'm +0 
on your suggestion.

7. We have a disagreement here. Needing the user to edit solrconfig.xml 
necessarily in order to specify which collection uses which package is bad from 
simplicity standpoint. Please keep in mind that hand editing configset is an 
expert feature.

8. No, we're not forcing regular users to do anything extra. Regular users 
should be using config APIs to register/deregister their plugins. Expert users 
are expert enough to just add an extra package name while they are hand editing 
their solrconfig.xml. I think this is a very reasonable compromise.

9. Plan is to support both: (a) jar without manifest + external manifest, (b) 
jar containing manifest.
 The first scenario should be preferred in case of multi-jar packages.

10. Sure, I like the idea of supporting multiple jars as well as a zip 
containing all the jars. It can be supported.

11. Plugin initialisation commands are not complex at all; they are typically 
just regular config API commands to register the plugins. They are necessary 
for pleasant adoption. A user just needs to install a package and specify which 
collection he needs to deploy his package to. Simple! If we go by your idea, 
then user would install the package but need to hand-edit the solrconfig.xml in 
order to add the plugins from the package to his collection.

12. I agree. We thought a lot about how to support filenames, and all the 
approaches seemed to have some deficiency. I've documented that as a comment in 
the design document. Without an additional distributed KV datastore, this seems 
hard to get right. The sha256.properties file approach will not work well, 
since every node will have its own version of the file, and also maintaining 
update consistency will be hard.
{quote}But I put a lot of careful thought into the POC which I feel is largely 
lacking here
{quote}
I assure you that we've put a lot of thought into how we designed this. Our 
main focus is ease of use and a robust plugin lifecycle management experience. 
Your PoC was lacking some major pieces 

[GitHub] [lucene-solr] diegoceccarelli commented on issue #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch)

2019-09-26 Thread GitBox
diegoceccarelli commented on issue #300: SOLR-11831: Skip second grouping step 
if group.limit is 1 (aka Las Vegas Patch)
URL: https://github.com/apache/lucene-solr/pull/300#issuecomment-535627929
 
 
   Hi @cpoerschke! I think I addressed your comments, please let me know if I 
missed anything! 
   
   Overview of the changes:
   - I found an explanation for the mystery of the two test failing and I fixed 
them.
   - Added checks for `numFound == 1`
   - Improved documentation 
   - Forbid `group.func` and `group.query` and documented it. 
   - Fix issues with `maxScore`, added tests. 
   
   I have fixed some issues with distribute `maxScore` that are not related to 
the patch but I needed to fix for tests and I think I'm going to move them into 
a separate PR.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938863#comment-16938863
 ] 

Ishan Chattopadhyaya commented on SOLR-13661:
-

1.
{quote}too many decisions seem to be made with too few eyes 
{quote}
Noble, Ishan, Jan, Andrzej, David. We've had at least these many "eyes". How 
many more or who else are needed?

2.
{quote}"package" concept seems to be designed for ONE use case only, customer's 
internal custom packages, with arbitrary local naming of repos and packages
{quote}
Not at all. Apache's contrib modules can easily be installed/deployed the same 
way, they can be packages themselves.
{quote}before such a feature goes mainstream, the design should also include 
converting some of our contrib modules to packages that we release as separate 
binaries in the mirrors, and enable an "apache" Repo as default.
{quote}
Converting a contrib module needn't be a precursor to releasing this feature. 
Moreover, without the Gradle work completed, I don't want to attempt to change 
the build system too much (only to do it again later).
{quote}Perhaps that would mean some name spacing or name collision resolution
{quote}
Namespacing the packages is something we thought of. We already have two pieces 
of information available in public APIs and all system boundaries/interfaces: 
"repository" and "package-name". If we need to internally namespace the 
packages, we can do so later. However, I don't think we need to do this: look 
at dnf, apt etc. systems, and there's no concept of namespaces of packages. 
That means, third parties should be careful enough not to name their packages 
same as official packages.

3. Sure.
{quote}We need a plan for how 3rd party plugin developers can publish their 
plugins on their own web site or on GitHub in a well defined way.
{quote}
I can add a document to that effect in our ref-guide. Initially, just the 
repository structure documented in the design document will be supported. 
Github support can be added subsequently.

4.
{quote}Hot/Cold deploy. I don't like systems where you, as part of the install 
need to spin up a server.
{quote}
Spinning up a cluster prior to installing the packages is not "needed". Someone 
can cold start a cluster with plugins pre-installed. Noble and I have both 
documented those steps in the design doc as a reply to your comment.

6. That's a tradeoff. I initially raised the same point, but Noble suggested 
that adding a new set of znode watchers per collection is an overhead. I'm +0 
on your suggestion.

7. We have a disagreement here. Needing the user to edit solrconfig.xml 
necessarily in order to specify which collection uses which package is bad from 
simplicity standpoint. Please keep in mind that hand editing configset is an 
expert feature.

8. No, we're not forcing regular users to do anything extra. Regular users 
should be using config APIs to register/deregister their plugins. Expert users 
are expert enough to just add an extra package name while they are hand editing 
their solrconfig.xml. I think this is a very reasonable compromise.

9. Plan is to support both: (a) jar without manifest + external manifest, (b) 
jar containing manifest.
 The first scenario should be preferred in case of multi-jar packages.

10. Sure, I like the idea of supporting multiple jars as well as a zip 
containing all the jars. It can be supported.

11. Plugin initialisation commands are not complex at all; they are typically 
just regular config API commands to register the plugins. They are necessary 
for pleasant adoption. A user just needs to install a package and specify which 
collection he needs to deploy his package to. Simple! If we go by your idea, 
then user would install the package but need to hand-edit the solrconfig.xml in 
order to add the plugins from the package to his collection.

12. I agree. We thought a lot about how to support filenames, and all the 
approaches seemed to have some deficiency. I've documented that as a comment in 
the design document. Without an additional distributed KV datastore, this seems 
hard to get right. The sha256.properties file approach will not work well, 
since every node will have its own version of the file, and also maintaining 
update consistency will be hard.
{quote}But I put a lot of careful thought into the POC which I feel is largely 
lacking here
{quote}
I assure you that we've put a lot of thought into how we designed this. Our 
main focus is ease of use and a robust plugin lifecycle management experience. 
Your PoC was lacking some major pieces that we've covered meticulously here: 
security, efficient loading (without needing to restart nodes as in your PoC), 
ease of deployment, not requiring packages to depend on PF4J (your PoC forced 
users to add PF4J as a dependency, perhaps to facilitate the loading) etc.

Just because I haven't documented each and every user story that you 

[GitHub] [lucene-solr] diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch)

2019-09-26 Thread GitBox
diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip 
second grouping step if group.limit is 1 (aka Las Vegas Patch)
URL: https://github.com/apache/lucene-solr/pull/300#discussion_r328753104
 
 

 ##
 File path: 
lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
 ##
 @@ -139,10 +139,18 @@ public ScoreMode scoreMode() {
   // System.out.println("  group=" + (group.groupValue == null ? "null" : 
group.groupValue.toString()));
   SearchGroup searchGroup = new SearchGroup<>();
   searchGroup.groupValue = group.groupValue;
+  // We pass this around so that we can get the corresponding solr id when 
serializing the search group to send to the federator
+  searchGroup.topDocLuceneId = group.topDoc;
   searchGroup.sortValues = new Object[sortFieldCount];
   for(int sortFieldIDX=0;sortFieldIDX

[GitHub] [lucene-solr] diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch)

2019-09-26 Thread GitBox
diegoceccarelli commented on a change in pull request #300: SOLR-11831: Skip 
second grouping step if group.limit is 1 (aka Las Vegas Patch)
URL: https://github.com/apache/lucene-solr/pull/300#discussion_r328753104
 
 

 ##
 File path: 
lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
 ##
 @@ -139,10 +139,18 @@ public ScoreMode scoreMode() {
   // System.out.println("  group=" + (group.groupValue == null ? "null" : 
group.groupValue.toString()));
   SearchGroup searchGroup = new SearchGroup<>();
   searchGroup.groupValue = group.groupValue;
+  // We pass this around so that we can get the corresponding solr id when 
serializing the search group to send to the federator
+  searchGroup.topDocLuceneId = group.topDoc;
   searchGroup.sortValues = new Object[sortFieldCount];
   for(int sortFieldIDX=0;sortFieldIDX

[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

2019-09-26 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-8991:
---
Status: Patch Available  (was: Open)

> disable java.util.HashMap assertions to avoid spurious vailures due to 
> JDK-8205399
> --
>
> Key: LUCENE-8991
> URL: https://issues.apache.org/jira/browse/LUCENE-8991
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-8991.patch
>
>
> An incredibly common class of jenkins failure (at least in Solr tests) stems 
> from triggering assertion failures in java.util.HashMap -- evidently 
> triggering bug JDK-8205399, first introduced in java-10, and fixed in 
> java-12, but has never been backported to any java-10 or java-11 bug fix 
> release...
>https://bugs.openjdk.java.net/browse/JDK-8205399
> SOLR-13653 tracks how this bug can affect Solr users, but I think it would 
> make sense to disable java.util.HashMap in our build system to reduce the 
> confusing failures when users/jenkins runs tests, since there is nothing we 
> can do to work around this when testing with java-11 (or java-10 on branch_8x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

2019-09-26 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-8991:
---
Description: 
An incredibly common class of jenkins failure (at least in Solr tests) stems 
from triggering assertion failures in java.util.HashMap -- evidently triggering 
bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has 
never been backported to any java-10 or java-11 bug fix release...

   https://bugs.openjdk.java.net/browse/JDK-8205399

SOLR-13653 tracks how this bug can affect Solr users, but I think it would make 
sense to disable java.util.HashMap in our build system to reduce the confusing 
failures when users/jenkins runs tests, since there is nothing we can do to 
work around this when testing with java-11 (or java-10 on branch_8x)


  was:

An incredibly common class of jenkins failure (at least in Solr tests) stems 
from triggering assertion failures in java.util.HashMap -- evidently triggering 
bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has 
never been backported to any java-10 or java-11 bug fix release...

   https://bugs.openjdk.java.net/browse/JDK-8205399

SOLR-13653 tracks how this bug can affect Solr users, but I think it would make 
sense to disable java.util.HashMap in our build system to reduce the confusing 
failures when users/jenkins runs tests, since there is nothing we can do to 
work around this when testing with java-11 (or java-10 on branch_8x)


 Labels: Java10 Java11  (was: )

> disable java.util.HashMap assertions to avoid spurious vailures due to 
> JDK-8205399
> --
>
> Key: LUCENE-8991
> URL: https://issues.apache.org/jira/browse/LUCENE-8991
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
>  Labels: Java10, Java11
> Attachments: LUCENE-8991.patch
>
>
> An incredibly common class of jenkins failure (at least in Solr tests) stems 
> from triggering assertion failures in java.util.HashMap -- evidently 
> triggering bug JDK-8205399, first introduced in java-10, and fixed in 
> java-12, but has never been backported to any java-10 or java-11 bug fix 
> release...
>https://bugs.openjdk.java.net/browse/JDK-8205399
> SOLR-13653 tracks how this bug can affect Solr users, but I think it would 
> make sense to disable java.util.HashMap in our build system to reduce the 
> confusing failures when users/jenkins runs tests, since there is nothing we 
> can do to work around this when testing with java-11 (or java-10 on branch_8x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

2019-09-26 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-8991:
---
Attachment: LUCENE-8991.patch
Status: Open  (was: Open)


FYI: I tried asking Rory about the open-jdk backporting process and why the fix 
for JDK-8205399 had never been backported for inclusion in 11.0.2 or 11.0.3 (or 
at this point 11.0.4) given how long the issue is been known relative to when 
those releases came out, but never got a response...

http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/%3calpine.DEB.2.11.1907251029100.10893@tray%3e



An example of how this type of bug can manifest in our tests, from a recent 
jenkins failure...

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom 
-Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   13.4s J1 | TestCloudJSONFacetSKG.testRandom <<<
   [junit4]> Throwable #1: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at 
https://127.0.0.1:43673/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection:
 Expected mime type application/octet-stream but got text/html. 
   [junit4]> 
   [junit4]> 
   [junit4]> Error 500 Server Error
   [junit4]> 
   [junit4]> HTTP ERROR 500
   [junit4]> Problem accessing 
/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection/select. 
Reason:
   [junit4]> Server ErrorCaused 
by:java.lang.AssertionError
   [junit4]>at 
java.base/java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1896)
   [junit4]>at 
java.base/java.util.HashMap$TreeNode.putTreeVal(HashMap.java:2061)
   [junit4]>at java.base/java.util.HashMap.putVal(HashMap.java:633)
   [junit4]>at java.base/java.util.HashMap.put(HashMap.java:607)
   [junit4]>at 
org.apache.solr.search.LRUCache.putCacheValue(LRUCache.java:295)
   [junit4]>at 
org.apache.solr.search.LRUCache.put(LRUCache.java:268)
   [junit4]>at 
org.apache.solr.search.SolrCacheHolder.put(SolrCacheHolder.java:92)
{noformat}

TestCloudJSONFacetSKG seems to trigger this assertion bug a lot, but I've also 
seen it pop up in other tests.  I haven't really dug into the details of 
JDK-8205399, but I suspect the size/balancing/rebalancing of the HashMap is 
what tickles the affected code path, so (i guess) tests that result in largeish 
HashMaps seem more likely to trigger it?



The attached path is a simple change to lucene/common-build.xml to modify our 
existing use of {{"\-ea \-esa"}} into {{"\-ea \-esa \-da:java.util.HashMap"}}

Any objections?


> disable java.util.HashMap assertions to avoid spurious vailures due to 
> JDK-8205399
> --
>
> Key: LUCENE-8991
> URL: https://issues.apache.org/jira/browse/LUCENE-8991
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-8991.patch
>
>
> An incredibly common class of jenkins failure (at least in Solr tests) stems 
> from triggering assertion failures in java.util.HashMap -- evidently 
> triggering bug JDK-8205399, first introduced in java-10, and fixed in 
> java-12, but has never been backported to any java-10 or java-11 bug fix 
> release...
>https://bugs.openjdk.java.net/browse/JDK-8205399
> SOLR-13653 tracks how this bug can affect Solr users, but I think it would 
> make sense to disable java.util.HashMap in our build system to reduce the 
> confusing failures when users/jenkins runs tests, since there is nothing we 
> can do to work around this when testing with java-11 (or java-10 on branch_8x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-8991) disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

2019-09-26 Thread Chris M. Hostetter (Jira)
Chris M. Hostetter created LUCENE-8991:
--

 Summary: disable java.util.HashMap assertions to avoid spurious 
vailures due to JDK-8205399
 Key: LUCENE-8991
 URL: https://issues.apache.org/jira/browse/LUCENE-8991
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Chris M. Hostetter



An incredibly common class of jenkins failure (at least in Solr tests) stems 
from triggering assertion failures in java.util.HashMap -- evidently triggering 
bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has 
never been backported to any java-10 or java-11 bug fix release...

   https://bugs.openjdk.java.net/browse/JDK-8205399

SOLR-13653 tracks how this bug can affect Solr users, but I think it would make 
sense to disable java.util.HashMap in our build system to reduce the confusing 
failures when users/jenkins runs tests, since there is nothing we can do to 
work around this when testing with java-11 (or java-10 on branch_8x)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13661) A package management system for Solr

2019-09-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938830#comment-16938830
 ] 

Jan Høydahl commented on SOLR-13661:


I have just looked at some of the code and will not have time for a more 
thorough review until week after next.

Here is a list of my main concerns so far:
 # My main concern is that too many decisions seem to be made with too few 
eyes, combined with a goal of merging very soon.
 # One example of "too few eyes" is that the "package" concept seems to be 
designed for ONE use case only, customer's internal custom packages, with 
arbitrary local naming of repos and packages. I think before such a feature 
goes mainstream, the design should also include converting some of our contrib 
modules to packages that we release as separate binaries in the mirrors, and 
enable an "apache" Repo as default. That requires some more thought behind 
stable name-spacing, so that e.g. “bin/solr install ltr” will mean the same for 
all customers. Perhaps that would mean some name spacing or name collision 
resolution, so if you have a custom local repo with a package also called 
"ltr", then you get an error which can be resolved by qualifying the package 
name like e.g. "apache:ltr" or "mylocalrepo:ltr".
 # We need a plan for how 3rd party plugin developers can publish their plugins 
on their own web site or on GitHub in a well defined way. The use of 
pf4j-update lib takes care of much of this, and this is also something that can 
be added incrementally, but the design needs to allow for this. My POC has a 
RepositoryFactory class that parses the repo URL (e.g. "bin/solr plugin repo 
add myrepo [https://host.com/repo/name];) and selects the 
GitHubUpdateRepository if it is a GitHub URL, the ApacheMirrorsUpdateRepository 
if it is an apache.org address, and the default site/FS repo else. Each of 
these handle the download process and signature verification in a different way.
 # Hot/Cold deploy. I don't like systems where you, as part of the install need 
to spin up a server. We already have this with setting urlScheme in ZK for 
HTTPS. But ideally it should be possible to do a Solr install including plugins 
before you need to spin up Solr. Elasticsearch uses such a static plugin 
installer (but also don't support hot install). Having a "staging" folder where 
you can drop package ZIP files (or JARs) where the node can self-install 
packages during first startup could be one way to handle this.
 # Robustness during upgrades is another concern. I don't see mentioned in the 
design doc what happens during a Solr upgrade. We should think through the 
scenario for both minor and major version upgrade for Solr, and then I mean 
rolling upgrade. Having ZK as only master for what version of a plugin should 
be used is probably not sufficient, as during a rolling Solr upgrade, you could 
have one node on 8.3 and another node on, say, 9.0. And you could have 
packageA:1.0 installed but Solr9 requires v2.0 due to removal of some APIs or 
what not. In the cold scenario (as in POC) you'd shut down a Solr node, upgrade 
Solr, then run "bin/solr plugin upgrade outdated" before starting that node 
again, and that would make sure it has the correct plugin version. Since you 
cannot upgrade Solr while it is running, perhaps we need to hook in some 
validation on node startup that it does not have any packages that won't work 
with that Solr version, and refuse to start. And some way to have two versions 
of a package installed at the same time, and then instead of using the latest, 
the Solr node will select the newest version that is compatible. Then when that 
node is upgraded it will select the new version of the plugin automatically 
based on Version.java.
 # Package system deserves its own Znode in Zookeeper instead of abusing 
clusterProps
 # I don't like the concept of an admin needing to "deploy" a package to a 
collection using a command. Rather, the collections should require a set of 
packages (optionally with min version) and fail to start if it is not available 
in the system. If the package is available in the system, the collection should 
gain access to the package(s) it required without running a deploy command.
 # Simplicity should be front seat. Don't force users to have to add 
{{package="my-pkg"}} wherever they today can say 
{{class="com.example.MyPlugin"}}. This is what we have ResourceLoader and class 
loaders for. If we cannot find {{com.example.MyPlugin}} in main class loader, 
then hunt through every package class loader until you have a match, if no 
match then throw ClassNotFound. (I never liked the {{runtimeLib=true}} 
equivalent in the old blob store.)
 # The package design says that a manifest is not required for a package and 
that any plain jar can function as a package just by registering it manually. 
That is ok as an alternative workflow, but most packages (and all 

[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-09-26 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938818#comment-16938818
 ] 

Mark Miller commented on SOLR-13796:


Some other items:
 * Moving everything off the old slow base SolrCloud tests.
 * Removing old tests that are either silly now and/or little to zero value for 
their cost.
 * General clean up and sensible cob web eating.

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document

2019-09-26 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938764#comment-16938764
 ] 

Ignacio Vera commented on LUCENE-8990:
--

Thanks Atri!

> IndexOrDocValuesQuery can take a bad decision for range queries if field has 
> many values per document
> -
>
> Key: LUCENE-8990
> URL: https://issues.apache.org/jira/browse/LUCENE-8990
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range 
> queries . The leadCost that is provided is based on number of documents, 
> meanwhile the cost() of a range query is based on the number of points that 
> potentially match the query. 
> Therefore it might happen that a BKD tree has millions of points but this 
> points correspond to just a few documents. Therefore we can take the decision 
> of executing the query using docValues and in fact we are almost scanning all 
> the points.
> Maybe the cost() function for range queries need to take into account the 
> average number of points per document in the tree and adjust the value 
> accordingly.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document

2019-09-26 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938748#comment-16938748
 ] 

Ignacio Vera edited comment on LUCENE-8990 at 9/26/19 4:00 PM:
---

I was thinking more something like:

 
{code:java}
double pointsPerDoc = values.size() / values.getDocCount();
values.estimatePointCount(visitor) / pointsPerDoc;{code}
 

 
Maybe that can be abstracted out as a new method in PointValues like 
{{estimateDocCount()}}.


was (Author: ivera):
I was thinking more something like:

 
{code:java}
double pointsPerDoc = values.size() / values.getDocCount();
values.estimatePointCount(visitor) / pointsPerDoc;{code}
 

 

Maybe that can be abstracted out as a new method in the IntersectVisitor like 
{{estimateDocCount()}}.

> IndexOrDocValuesQuery can take a bad decision for range queries if field has 
> many values per document
> -
>
> Key: LUCENE-8990
> URL: https://issues.apache.org/jira/browse/LUCENE-8990
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range 
> queries . The leadCost that is provided is based on number of documents, 
> meanwhile the cost() of a range query is based on the number of points that 
> potentially match the query. 
> Therefore it might happen that a BKD tree has millions of points but this 
> points correspond to just a few documents. Therefore we can take the decision 
> of executing the query using docValues and in fact we are almost scanning all 
> the points.
> Maybe the cost() function for range queries need to take into account the 
> average number of points per document in the tree and adjust the value 
> accordingly.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document

2019-09-26 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938750#comment-16938750
 ] 

Atri Sharma commented on LUCENE-8990:
-

+1

 

I am happy to take a crack at this if you are not planning to do so.

> IndexOrDocValuesQuery can take a bad decision for range queries if field has 
> many values per document
> -
>
> Key: LUCENE-8990
> URL: https://issues.apache.org/jira/browse/LUCENE-8990
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range 
> queries . The leadCost that is provided is based on number of documents, 
> meanwhile the cost() of a range query is based on the number of points that 
> potentially match the query. 
> Therefore it might happen that a BKD tree has millions of points but this 
> points correspond to just a few documents. Therefore we can take the decision 
> of executing the query using docValues and in fact we are almost scanning all 
> the points.
> Maybe the cost() function for range queries need to take into account the 
> average number of points per document in the tree and adjust the value 
> accordingly.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document

2019-09-26 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938748#comment-16938748
 ] 

Ignacio Vera commented on LUCENE-8990:
--

I was thinking more something like:

 
{code:java}
double pointsPerDoc = values.size() / values.getDocCount();
values.estimatePointCount(visitor) / pointsPerDoc;{code}
 

 

Maybe that can be abstracted out as a new method in the IntersectVisitor like 
{{estimateDocCount()}}.

> IndexOrDocValuesQuery can take a bad decision for range queries if field has 
> many values per document
> -
>
> Key: LUCENE-8990
> URL: https://issues.apache.org/jira/browse/LUCENE-8990
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range 
> queries . The leadCost that is provided is based on number of documents, 
> meanwhile the cost() of a range query is based on the number of points that 
> potentially match the query. 
> Therefore it might happen that a BKD tree has millions of points but this 
> points correspond to just a few documents. Therefore we can take the decision 
> of executing the query using docValues and in fact we are almost scanning all 
> the points.
> Maybe the cost() function for range queries need to take into account the 
> average number of points per document in the tree and adjust the value 
> accordingly.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-09-26 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938742#comment-16938742
 ] 

Mark Miller commented on SOLR-13796:


I would also like to start enforcing limits on non Nightly tests - runtimes 
that exceed a certain fairly low threshold will start failing tests and 
suggesting alternatives or Nightly. Close times of critical components will 
also be instrumented and tracked and limited for reasonable times.

There is also a new, consistent, fast way to close objects safely that does 
this close tacking and has also sped up object lifecycle quite a bit even where 
we already tried to things fast.

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13796) Fix Solr Test Performance

2019-09-26 Thread Mark Miller (Jira)
Mark Miller created SOLR-13796:
--

 Summary: Fix Solr Test Performance
 Key: SOLR-13796
 URL: https://issues.apache.org/jira/browse/SOLR-13796
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Mark Miller
Assignee: Mark Miller


I had kind of forgotten, but while working on Starburst I had realized that 
almost all of our tests are capable of being very fast and logging 10x less as 
a result. When they get this fast, a lot of infrequent random fails become 
frequent and things become much easier to debug. I had fixed a lot of issue to 
make tests pretty damn fast in the starburst branch, but tons of tests where 
still ignored due to the scope of changes going on.

A variety of things have converged that have allowed me to absorb most of that 
work and build up on it while also almost finishing it.

This will be another huge PR aimed at addressing issues that have our tests 
often take dozens of seconds to minutes when they should take mere seconds or 
10.

As part of this issue, I would like to move the focus of non nightly tests 
towards being more minimal, consistent and fast.

In exchanged, we must put more effort and care in nightly tests. Not something 
that happens now, but if we have solid, fast, consistent non Nightly tests, 
that should open up some room for Nightly to get some status boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13722) Package Management APIs

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938726#comment-16938726
 ] 

Ishan Chattopadhyaya commented on SOLR-13722:
-

[~noble.paul] , I've updated the titles of this and SOLR-13710. Can you please 
update their descriptions to closely relate to the design document? This 
issue's description provides absolutely no clue what this issue is about.

> Package Management APIs
> ---
>
> Key: SOLR-13722
> URL: https://issues.apache.org/jira/browse/SOLR-13722
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: package
>
> This ticket totally eliminates the need for an external service to host the 
> jars. So a url will no longer be required. An external URL leads to 
> unreliability because the service may go offline or it can be DDoSed if/when 
> too many requests are sent to them
>  
>  
>  Add a jar to cluster as follows
> {code:java}
> curl -X POST -H 'Content-Type: application/octet-stream' --data-binary 
> @myjar.jar http://localhost:8983/api/cluster/blob
> {code}
> This does the following operations
>  * Upload this jar to all the live nodes in the system
>  * The name of the file is the {{sha256}} of the file/payload
>  * The blob is agnostic of the content of the file/payload
> h2.  How it works?
> A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted 
> locally & all nodes are instructed to download it from this node or from any 
> other available node. If a node comes up later, it can query other nodes in 
> the system and download the blobs as required
> h2. {{add-package}} command
> {code:java}
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>   "add-package": {
>"name": "my-package" ,
>   "sha256":""
>   }}' http://localhost:8983/api/cluster
> {code}
>  The {{sha256}} is the same as the file name. It gets hold of the jar using 
> the following steps
>  * check the local file system for the blob
>  * If not available locally,  query other live nodes if they have the blob 
> (one by one)
>  * if a node has it , it's downloaded and persisted to it's local {{blob}} dir
> h2. Security
> The blob upload does not check for the content of the payload and it does not 
> verify the file. However, the {{add-package}} , {{update-package}} commands 
> check for the signatures (if enabled) . 
>  The size of the file is limited to 5MB,to avoid (OOM). This can be changed 
> using a system property {{runtime.lib.size}} . 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13710) FSBlobStore: a new blob store

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-13710:

Description: 
* All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. 
* The file names will be the sha256 hash of the files.
* Before downloading the a jar from a location, it's first checked in the local 
directory
* POST a jar to http://localhost://8983/api/cluster/blob to distibute it in the 
cluster
* A new API end point {{http://localhost://8983/api/node/blob}} will list the 
available jars

example
{code:json}
{
"blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc",
"79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"]
}
{code}
* The jar will be downloadable at 
{{http://localhost://8983/api/node/blob/}} 


Design: 
https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#heading=h.qxgax9a5br5o

  was:
* All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. 
* The file names will be the sha256 hash of the files.
* Before downloading the a jar from a location, it's first checked in the local 
directory
* POST a jar to http://localhost://8983/api/cluster/blob to distibute it in the 
cluster
* A new API end point {{http://localhost://8983/api/node/blob}} will list the 
available jars

example
{code:json}
{
"blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc",
"79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"]
}
{code}
* The jar will be downloadable at 
{{http://localhost://8983/api/node/blob/}} 





> FSBlobStore: a new blob store
> -
>
> Key: SOLR-13710
> URL: https://issues.apache.org/jira/browse/SOLR-13710
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> * All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. 
> * The file names will be the sha256 hash of the files.
> * Before downloading the a jar from a location, it's first checked in the 
> local directory
> * POST a jar to http://localhost://8983/api/cluster/blob to distibute it in 
> the cluster
> * A new API end point {{http://localhost://8983/api/node/blob}} will list the 
> available jars
> example
> {code:json}
> {
> "blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc",
> "79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"]
> }
> {code}
> * The jar will be downloadable at 
> {{http://localhost://8983/api/node/blob/}} 
> Design: 
> https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#heading=h.qxgax9a5br5o



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13710) FSBlobStore: a new blob store

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-13710:

Summary: FSBlobStore: a new blob store  (was: Persist package jars locally 
& expose them over http)

> FSBlobStore: a new blob store
> -
>
> Key: SOLR-13710
> URL: https://issues.apache.org/jira/browse/SOLR-13710
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> * All jars for packages downloaded are stored in a dir SOLR_HOME/blobs. 
> * The file names will be the sha256 hash of the files.
> * Before downloading the a jar from a location, it's first checked in the 
> local directory
> * POST a jar to http://localhost://8983/api/cluster/blob to distibute it in 
> the cluster
> * A new API end point {{http://localhost://8983/api/node/blob}} will list the 
> available jars
> example
> {code:json}
> {
> "blob":["e1f9e23988c19619402f1040c9251556dcd6e02b9d3e3b966a129ea1be5c70fc",
> "79298d7d5c3e60d91154efe7d72f4536eac46698edfa22ab894b85492d562ed4"]
> }
> {code}
> * The jar will be downloadable at 
> {{http://localhost://8983/api/node/blob/}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13722) A cluster-wide blob upload package option & avoid remote url

2019-09-26 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938717#comment-16938717
 ] 

Ishan Chattopadhyaya commented on SOLR-13722:
-

[~dsmiley], based on the description, this issue seems to be to provide APIs to 
load/unload/update jars from blob store into the classpath. Corresponds to the 
"Package Management APIs", I think. Can you confirm, Noble?

Also, Noble, I think the sub-tasks JIRAs should align closely with the design 
document sections.

> A cluster-wide blob upload package option & avoid remote url
> 
>
> Key: SOLR-13722
> URL: https://issues.apache.org/jira/browse/SOLR-13722
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: package
>
> This ticket totally eliminates the need for an external service to host the 
> jars. So a url will no longer be required. An external URL leads to 
> unreliability because the service may go offline or it can be DDoSed if/when 
> too many requests are sent to them
>  
>  
>  Add a jar to cluster as follows
> {code:java}
> curl -X POST -H 'Content-Type: application/octet-stream' --data-binary 
> @myjar.jar http://localhost:8983/api/cluster/blob
> {code}
> This does the following operations
>  * Upload this jar to all the live nodes in the system
>  * The name of the file is the {{sha256}} of the file/payload
>  * The blob is agnostic of the content of the file/payload
> h2.  How it works?
> A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted 
> locally & all nodes are instructed to download it from this node or from any 
> other available node. If a node comes up later, it can query other nodes in 
> the system and download the blobs as required
> h2. {{add-package}} command
> {code:java}
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>   "add-package": {
>"name": "my-package" ,
>   "sha256":""
>   }}' http://localhost:8983/api/cluster
> {code}
>  The {{sha256}} is the same as the file name. It gets hold of the jar using 
> the following steps
>  * check the local file system for the blob
>  * If not available locally,  query other live nodes if they have the blob 
> (one by one)
>  * if a node has it , it's downloaded and persisted to it's local {{blob}} dir
> h2. Security
> The blob upload does not check for the content of the payload and it does not 
> verify the file. However, the {{add-package}} , {{update-package}} commands 
> check for the signatures (if enabled) . 
>  The size of the file is limited to 5MB,to avoid (OOM). This can be changed 
> using a system property {{runtime.lib.size}} . 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13722) A cluster-wide blob upload package option & avoid remote url

2019-09-26 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938711#comment-16938711
 ] 

David Wayne Smiley commented on SOLR-13722:
---

There are two sub-tasks that, based on the title alone, seem like the same 
thing.  This one and SOLR-13710 So it's not clear where to discuss a new/second 
"blob store".  Can you help disambiguate these for me [~noble.paul]?

> A cluster-wide blob upload package option & avoid remote url
> 
>
> Key: SOLR-13722
> URL: https://issues.apache.org/jira/browse/SOLR-13722
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: package
>
> This ticket totally eliminates the need for an external service to host the 
> jars. So a url will no longer be required. An external URL leads to 
> unreliability because the service may go offline or it can be DDoSed if/when 
> too many requests are sent to them
>  
>  
>  Add a jar to cluster as follows
> {code:java}
> curl -X POST -H 'Content-Type: application/octet-stream' --data-binary 
> @myjar.jar http://localhost:8983/api/cluster/blob
> {code}
> This does the following operations
>  * Upload this jar to all the live nodes in the system
>  * The name of the file is the {{sha256}} of the file/payload
>  * The blob is agnostic of the content of the file/payload
> h2.  How it works?
> A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted 
> locally & all nodes are instructed to download it from this node or from any 
> other available node. If a node comes up later, it can query other nodes in 
> the system and download the blobs as required
> h2. {{add-package}} command
> {code:java}
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>   "add-package": {
>"name": "my-package" ,
>   "sha256":""
>   }}' http://localhost:8983/api/cluster
> {code}
>  The {{sha256}} is the same as the file name. It gets hold of the jar using 
> the following steps
>  * check the local file system for the blob
>  * If not available locally,  query other live nodes if they have the blob 
> (one by one)
>  * if a node has it , it's downloaded and persisted to it's local {{blob}} dir
> h2. Security
> The blob upload does not check for the content of the payload and it does not 
> verify the file. However, the {{add-package}} , {{update-package}} commands 
> check for the signatures (if enabled) . 
>  The size of the file is limited to 5MB,to avoid (OOM). This can be changed 
> using a system property {{runtime.lib.size}} . 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document

2019-09-26 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938700#comment-16938700
 ] 

Atri Sharma commented on LUCENE-8990:
-

+1, I think that is a good heuristic – strangely enough, I was thinking of this 
limitation for a similar problem.

 

Would it suffice if we just made PointRangeQuery also consider the BKDReader's 
docCount, in addition to pointCount? e.g. (cost = values.estimatePointCount() / 
values.estimateDocCount())?

> IndexOrDocValuesQuery can take a bad decision for range queries if field has 
> many values per document
> -
>
> Key: LUCENE-8990
> URL: https://issues.apache.org/jira/browse/LUCENE-8990
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range 
> queries . The leadCost that is provided is based on number of documents, 
> meanwhile the cost() of a range query is based on the number of points that 
> potentially match the query. 
> Therefore it might happen that a BKD tree has millions of points but this 
> points correspond to just a few documents. Therefore we can take the decision 
> of executing the query using docValues and in fact we are almost scanning all 
> the points.
> Maybe the cost() function for range queries need to take into account the 
> average number of points per document in the tree and adjust the value 
> accordingly.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13791) Remove BeanUtils reference from ivy-versions.properties

2019-09-26 Thread Andras Salamon (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938694#comment-16938694
 ] 

Andras Salamon commented on SOLR-13791:
---

There were more reference to beanutils, uploaded a new patch.

> Remove BeanUtils reference from ivy-versions.properties
> ---
>
> Key: SOLR-13791
> URL: https://issues.apache.org/jira/browse/SOLR-13791
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Major
> Attachments: SOLR-13791-01.patch, SOLR-13791-02.patch
>
>
> SOLR-12617 removed Commons BeanUtils, but {{lucene/ivy-versions.properties}} 
> still have a reference to beanutils, because SOLR-9515 added this line back.
> We can remove this line.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13791) Remove BeanUtils reference from ivy-versions.properties

2019-09-26 Thread Andras Salamon (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Salamon updated SOLR-13791:
--
Attachment: SOLR-13791-02.patch

> Remove BeanUtils reference from ivy-versions.properties
> ---
>
> Key: SOLR-13791
> URL: https://issues.apache.org/jira/browse/SOLR-13791
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Major
> Attachments: SOLR-13791-01.patch, SOLR-13791-02.patch
>
>
> SOLR-12617 removed Commons BeanUtils, but {{lucene/ivy-versions.properties}} 
> still have a reference to beanutils, because SOLR-9515 added this line back.
> We can remove this line.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] chatman commented on issue #898: SOLR-13661: A package management system for Solr

2019-09-26 Thread GitBox
chatman commented on issue #898: SOLR-13661: A package management system for 
Solr
URL: https://github.com/apache/lucene-solr/pull/898#issuecomment-535539988
 
 
   * There are merge conflicts
   * The branch is SOLR-13722, but title is SOLR-13661. Is this raised from the 
right branch?
   * Seems like blob store changes are in this PR that is package manager 
related; shouldn't they be separated in different issues?
   * The commit messages seem very temporary, we should squash merge this 
branch (if we decide to do so).
   * One of the commits suggests "repossitory CRUD", whereas it shouldn't be 
here. Maybe it was removed in a subsequent commit, but the commit messages 
don't indicate there are any.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13764) Parse Interval Query from JSON API

2019-09-26 Thread Mikhail Khludnev (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938688#comment-16938688
 ] 

Mikhail Khludnev commented on SOLR-13764:
-

Refer to 
https://cwiki.apache.org/confluence/display/SOLR/SOLR-13764+Discussion+-+Interval+Queries+in+JSON
 for the syntax proposal. 

> Parse Interval Query from JSON API
> --
>
> Key: SOLR-13764
> URL: https://issues.apache.org/jira/browse/SOLR-13764
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Reporter: Mikhail Khludnev
>Priority: Major
>
> h2. Context
> Lucene has Intervals query LUCENE-8196. Note: these are a kind of healthy 
> man's Spans/Phrases. Note: It's not about ranges nor facets.
> h2. Problem
> There's no way to search by IntervalQuery via JSON Query DSL.
> h2. Suggestion
>  * Create classic QParser \{{ {!interval df=text_content}a_json_param}}, ie 
> one can combine a few such refs in {{json.query.bool}}
>  * It accepts just a name of JSON params, nothing like this happens yet.
>  * This param carries plain json which is accessible via {{req.getJSON()}}
> {\{{}}
>  {{  query: {bool:{should:[}}
>  {{    \{interval:i_1},}}
>  {{    {interval:
> {query:i_2, df:title}
> }}
>  {{  }]}},}}
>  {{  params:{}}
>  {{    df: description_t,}}
>  {{    i_1:\{phrase:"lorem ipsum"},}}
>  {{    i_2:{ unordered: [
> {term:"bar"}
> ,\{phrase:"bag ban"}]}}}
>  \{{  }}}
>  {{}}}
> h2. Challenges
>  * I have no idea about particular JSON DSL for these queries, Lucene API 
> seems like easy JSON-able. Proposals are welcome.
>  * Another awkward things is combining analysis and low level query API. eg 
> what if one request term for one word and analysis yield two tokens, and vice 
> versa requesting phrase might end up with single token stream.
>  * Putting json into Jira ticket description
> h2. Q: Why don't..
> .. put intervals DSL right into {{json.query}}, avoiding these odd param 
> refs? 
>  A: It requires heavy lifting for {{JsonQueryConverter}} which is streamlined 
> for handling old good http parametrized queires.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thomaswoeckinger edited a comment on issue #902: SOLR-13795: Reload solr core after schema is persisted.

2019-09-26 Thread GitBox
thomaswoeckinger edited a comment on issue #902: SOLR-13795: Reload solr core 
after schema is persisted.
URL: https://github.com/apache/lucene-solr/pull/902#issuecomment-535515216
 
 
   @dsmiley: May you have time to review?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thomaswoeckinger commented on issue #902: SOLR-13795: Reload solr core after schema is persisted.

2019-09-26 Thread GitBox
thomaswoeckinger commented on issue #902: SOLR-13795: Reload solr core after 
schema is persisted.
URL: https://github.com/apache/lucene-solr/pull/902#issuecomment-535515216
 
 
   @dsmiley: May you have time to review!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13795) SolrIndexSearcher still uses old schema after schema update using schema-api

2019-09-26 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SOLR-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wöckinger updated SOLR-13795:

Labels: easyfix pull-request-available  (was: easyfix)

> SolrIndexSearcher still uses old schema after schema update using schema-api
> 
>
> Key: SOLR-13795
> URL: https://issues.apache.org/jira/browse/SOLR-13795
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: config-api, Schema and Analysis, Server, SolrJ, v2 API
>Affects Versions: 7.7.2, master (9.0), 8.2
>Reporter: Thomas Wöckinger
>Priority: Critical
>  Labels: easyfix, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When adding a new field to the schema using schema-api, the new field is not 
> known by the current SolrIndexSearcher. In SolrCloud any core gets reloaded 
> after the new schema is persisted, this does not happen in case of stand 
> alone HTTP Solr server or EmbeddedSolrServer.
> So currently an additional commit is necessary to open a new 
> SolrIndexSearcher using the new schema.
> Fix is really easy: Just reload the core!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thomaswoeckinger opened a new pull request #902: SOLR-13795: Reload solr core after schema is persisted.

2019-09-26 Thread GitBox
thomaswoeckinger opened a new pull request #902: SOLR-13795: Reload solr core 
after schema is persisted.
URL: https://github.com/apache/lucene-solr/pull/902
 
 
   
   
   
   # Description
   
   Please provide a short description of the changes you're making with this 
pull request.
   
   # Solution
   
   Please provide a short description of the approach taken to implement your 
solution.
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch 
implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I am authorized to contribute this code to the ASF and have removed 
any code I do not have a license to distribute.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13795) SolrIndexSearcher still uses old schema after schema update using schema-api

2019-09-26 Thread Jira
Thomas Wöckinger created SOLR-13795:
---

 Summary: SolrIndexSearcher still uses old schema after schema 
update using schema-api
 Key: SOLR-13795
 URL: https://issues.apache.org/jira/browse/SOLR-13795
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: config-api, Schema and Analysis, Server, SolrJ, v2 API
Affects Versions: 8.2, 7.7.2, master (9.0)
Reporter: Thomas Wöckinger


When adding a new field to the schema using schema-api, the new field is not 
known by the current SolrIndexSearcher. In SolrCloud any core gets reloaded 
after the new schema is persisted, this does not happen in case of stand alone 
HTTP Solr server or EmbeddedSolrServer.

So currently an additional commit is necessary to open a new SolrIndexSearcher 
using the new schema.

Fix is really easy: Just reload the core!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Closed] (SOLR-13788) Resolve multiple IPs from specified zookeeper URL

2019-09-26 Thread Ween Jiann (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ween Jiann closed SOLR-13788.
-

> Resolve multiple IPs from specified zookeeper URL
> -
>
> Key: SOLR-13788
> URL: https://issues.apache.org/jira/browse/SOLR-13788
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.1.1
>Reporter: Ween Jiann
>Priority: Minor
>  Labels: features
>
> Use DNS lookup to get the IPs of the servers listed in ZK_HOST or -z param. 
> This would help cloud deployment as DNS is often used to group services 
> together.
> [https://lucene.apache.org/solr/guide/8_1/setting-up-an-external-zookeeper-ensemble.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13788) Resolve multiple IPs from specified zookeeper URL

2019-09-26 Thread Ween Jiann (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ween Jiann resolved SOLR-13788.
---
Resolution: Not A Problem

> Resolve multiple IPs from specified zookeeper URL
> -
>
> Key: SOLR-13788
> URL: https://issues.apache.org/jira/browse/SOLR-13788
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.1.1
>Reporter: Ween Jiann
>Priority: Minor
>  Labels: features
>
> Use DNS lookup to get the IPs of the servers listed in ZK_HOST or -z param. 
> This would help cloud deployment as DNS is often used to group services 
> together.
> [https://lucene.apache.org/solr/guide/8_1/setting-up-an-external-zookeeper-ensemble.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-8990) IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document

2019-09-26 Thread Ignacio Vera (Jira)
Ignacio Vera created LUCENE-8990:


 Summary: IndexOrDocValuesQuery can take a bad decision for range 
queries if field has many values per document
 Key: LUCENE-8990
 URL: https://issues.apache.org/jira/browse/LUCENE-8990
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ignacio Vera


Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range queries 
. The leadCost that is provided is based on number of documents, meanwhile the 
cost() of a range query is based on the number of points that potentially match 
the query. 

Therefore it might happen that a BKD tree has millions of points but this 
points correspond to just a few documents. Therefore we can take the decision 
of executing the query using docValues and in fact we are almost scanning all 
the points.

Maybe the cost() function for range queries need to take into account the 
average number of points per document in the tree and adjust the value 
accordingly.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938496#comment-16938496
 ] 

Guoqiang Jiang commented on LUCENE-8980:


Hi, [~dsmiley], thanks for your suggestion. I have updated the description and 
comments.

Please help to  commit this improvement. Thanks again.

> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an indexed _id 
> field that uniquely identifies it. When Elasticsearch use the _id field to 
> find a document from Lucene, Lucene have to check all the segments of the 
> index. When the values of the _id field are very sequentially, the 
> performance is optimizable.
>   
> *Solution*
> Since Lucene stores min/maxTerm metrics for each segment and field, we can 
> use those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that we can 
> skip some useless segments as soon as possible.
>   
>  This improvement is beneficial to ES read/write API and Lucene look up API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
---
Description: 
*Description*

In Elasticsearch, which is based on Lucene, each document has an indexed _id 
field that uniquely identifies it. When Elasticsearch use the _id field to find 
a document from Lucene, Lucene have to check all the segments of the index. 
When the values of the _id field are very sequentially, the performance is 
optimizable.
  

*Solution*

Since Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
whether the term fall in the range of minTerm and maxTerm, so that we can skip 
some useless segments as soon as possible.
  
 This improvement is beneficial to ES read/write API and Lucene look up API.

  was:
*Description*

In Elasticsearch, which is based on Lucene, each document has an indexed _id 
field that uniquely identifies it. When Elasticsearch use the _id field to find 
a document from Lucene, Lucene have to check all the segments of the index. 
When the values of the _id field are very sequentially, the performance is 
optimizable.
 

*Solution*

Since Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
whether the term fall in the range of minTerm and maxTerm, so that we can skip 
some useless segments as soon as possible.
 
This PR is beneficial to ES read/write API and Lucene look up API.




> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an indexed _id 
> field that uniquely identifies it. When Elasticsearch use the _id field to 
> find a document from Lucene, Lucene have to check all the segments of the 
> index. When the values of the _id field are very sequentially, the 
> performance is optimizable.
>   
> *Solution*
> Since Lucene stores min/maxTerm metrics for each segment and field, we can 
> use those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that we can 
> skip some useless segments as soon as possible.
>   
>  This improvement is beneficial to ES read/write API and Lucene look up API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938423#comment-16938423
 ] 

Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 10:50 AM:
--

We run another test case  _wikimedium10m _to verify the improvement on a large 
data set. The complete results are 
[here|https://gist.github.com/jgq2008303393/44768d69a843c7b421e765bbab9360fd.js].
  The following table is the result of the last run:

|TaskQPS  |baseline|StdDevQPS|my_modified_version| StdDev 
|Pct_diff(percent_diff)|
| --- | :: | :-: | :---: | :: | 
:--: |
|OrHighNotLow | 293.93 | (5.8%)  |  286.46   | (6.6%) | 
-2.5%(-14% - 10%)|
|   OrHighNotHigh | 258.18 | (3.7%)  |  252.41   | (5.0%) | 
-2.2%(-10% -  6%)|
|   OrHighLow | 206.52 | (6.2%)  |  202.55   | (6.2%) | 
-1.9%(-13% - 11%)|
|   MedPhrase |  16.41 | (4.1%)  |   16.12   | (2.6%) | 
-1.7%( -8% -  5%)|
| LowTerm | 608.71 | (5.7%)  |  599.21   | (4.4%) | 
-1.6%(-10% -  9%)|
| Prefix3 |  37.96 | (2.8%)  |   37.51   | (3.8%) | 
-1.2%( -7% -  5%)|
|   OrNotHighHigh | 255.49 | (5.5%)  |  252.63   | (6.1%) | 
-1.1%(-12% - 11%)|
| MedSloppyPhrase |  13.71 | (3.5%)  |   13.58   | (3.7%) | 
-1.0%( -7% -  6%)|
|HighSloppyPhrase |  17.00 | (3.3%)  |   16.84   | (3.7%) | 
-0.9%( -7% -  6%)|
|  OrHighHigh |  19.02 | (2.6%)  |   18.85   | (2.7%) | 
-0.9%( -6% -  4%)|
| MedTerm | 564.56 | (4.6%)  |  559.38   | (2.9%) | 
-0.9%( -8% -  6%)|
|OrNotHighLow | 294.29 | (4.9%)  |  291.86   | (4.2%) | 
-0.8%( -9% -  8%)|
|  AndHighLow | 303.17 | (3.7%)  |  300.72   | (4.5%) | 
-0.8%( -8% -  7%)|
| AndHighHigh |  28.24 | (2.1%)  |   28.01   | (2.7%) | 
-0.8%( -5% -  4%)|
|Wildcard |  64.64 | (3.9%)  |   64.21   | (4.0%) | 
-0.7%( -8% -  7%)|
|HighSpanNear |  15.14 | (2.8%)  |   15.04   | (2.5%) | 
-0.7%( -5% -  4%)|
|HighTerm | 431.22 | (3.9%)  |  428.68   | (2.9%) | 
-0.6%( -7% -  6%)|
| LowSloppyPhrase |  19.29 | (2.2%)  |   19.18   | (2.9%) | 
-0.6%( -5% -  4%)|
| LowSpanNear |  64.32 | (2.3%)  |   63.99   | (2.0%) | 
-0.5%( -4% -  3%)|
|  Fuzzy2 |  34.51 |(12.8%)  |   34.34   |(11.9%) | 
-0.5%(-22% - 27%)|
| MedSpanNear |  51.51 | (2.3%)  |   51.28   | (1.6%) | 
-0.4%( -4% -  3%)|
|   HighTermDayOfYearSort |  51.45 | (6.6%)  |   51.24   | (7.5%) | 
-0.4%(-13% - 14%)|
|OrHighNotMed | 306.95 | (5.1%)  |  306.03   | (3.2%) | 
-0.3%( -8% -  8%)|
|BrowseDateTaxoFacets |   1.48 | (0.6%)  |1.47   | (1.2%) | 
-0.2%( -1% -  1%)|
|   BrowseMonthSSDVFacets |   6.15 | (1.1%)  |6.14   | (3.6%) | 
-0.2%( -4% -  4%)|
|  HighPhrase | 186.86 | (6.2%)  |  186.64   | (3.7%) | 
-0.1%( -9% - 10%)|
| Respell |  48.69 | (4.1%)  |   48.65   | (4.0%) | 
-0.1%( -7% -  8%)|
|  AndHighMed |  65.66 | (3.0%)  |   65.74   | (3.2%) |  
0.1%( -5% -  6%)|
|HighIntervalsOrdered |   6.68 | (1.5%)  |6.69   | (1.7%) |  
0.1%( -3% -  3%)|
|   LowPhrase | 219.11 | (5.7%)  |  220.24   | (3.5%) |  
0.5%( -8% - 10%)|
|   OrHighMed |  68.05 | (4.5%)  |   68.44   | (3.1%) |  
0.6%( -6% -  8%)|
|OrNotHighMed | 272.89 | (5.7%)  |  274.77   | (4.1%) |  
0.7%( -8% - 11%)|
|  IntNRQ |  37.58 |(23.8%)  |   37.96   |(24.2%) |  
1.0%(-37% - 64%)|
|BrowseDayOfYearSSDVFacets|   5.34 | (4.2%)  |5.40   | (2.9%) |  
1.2%( -5% -  8%)|
|   HighTermMonthSort |  34.82 |(11.7%)  |   35.81   |(14.9%) |  
2.9%(-21% - 33%)|
|   BrowseMonthTaxoFacets |4781.41 | (3.9%)  | 4931.19   | (2.7%) |  
3.1%( -3% - 10%)|
|  Fuzzy1 |  35.98 | (9.7%)  |   37.42   | (8.0%) |  
4.0%(-12% - 23%)|
|BrowseDayOfYearTaxoFacets|4688.64 | (3.6%)  | 4878.52   | (3.6%) |  
4.0%( -3% - 11%)|
|PKLookup |  72.93 | (4.7%)  |   95.23   | (3.3%) | 
30.6%( 21% - 40%)|



was (Author: jgq2008303393):
We run another test case _wikimedium10m _to verify the improvement on a large 
data set. The complete results are 
[here|https://gist.github.com/jgq2008303393/44768d69a843c7b421e765bbab9360fd.js].
  The following table is the result of the last run:

|TaskQPS  |baseline|StdDevQPS|my_modified_version| 

[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416
 ] 

Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 10:50 AM:
--

We have done more performance test using _luceneutil_ tool. And the complete 
test results are 
[here|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js].

The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The 
following table is the result of the last run. As shown in the table below, 
most of the indicators are basically stable, while the _PKLookup_ indicator has 
a performance improvement of 58.7%. 
|TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)|
|HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)|
|MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)|
|LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)|
|AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)|
|BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)|
|HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)|
|Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)|
|LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)|
|Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)|
|AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)|
|OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)|
|AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)|
|LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)|
|IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)|
|HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)|
|Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)|
|OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)|
|BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)|
|Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)|
|MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)|
|MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)|
|HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)|
|BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)|
|Respell|187.05|(12.2%)|189.48|(10.1%)|1.3%(-18% - 26%)|
|MedSloppyPhrase|583.45|(11.3%)|592.32|(9.9%)|1.5%(-17% - 25%)|
|HighTerm|1114.87|(12.0%)|1131.89|(12.8%)|1.5%(-20% - 29%)|
|HighTermDayOfYearSort|408.17|(13.1%)|416.13|(9.3%)|1.9%(-18% - 27%)|
|BrowseDayOfYearTaxoFacets|5460.05|(8.5%)|5591.96|(8.0%)|2.4%(-13% - 20%)|
|BrowseMonthTaxoFacets|5490.18|(8.0%)|5654.03|(9.3%)|3.0%(-13% - 22%)|
|LowSloppyPhrase|562.96|(10.1%)|583.91|(9.5%)|3.7%(-14% - 25%)|
|HighPhrase|221.20|(11.9%)|229.85|(12.2%)|3.9%(-17% - 31%)|
|OrHighMed|352.09|(12.3%)|369.39|(9.4%)|4.9%(-14% - 30%)|
|PKLookup|85.19|(18.1%)|135.38|(22.7%)|58.9%( 15% - 121%)|


was (Author: jgq2008303393):
We have done more performance test using _luceneutil_ tool. And the complete 
test results are 
[here|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js].

The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The 
following table is the result of the last run. As shown in the table below, 
most of the indicators are basically stable, while the _PKLookup_ indicator has 
a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch 
will also take benefit of this enhancement.
|TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)|
|HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)|
|MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)|
|LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)|
|AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)|
|BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)|
|HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)|
|Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)|
|LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)|
|Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)|
|AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)|
|OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)|
|AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)|
|LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)|
|IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)|
|HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)|
|Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)|
|OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)|
|BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)|
|Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)|
|MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)|
|MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)|
|HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)|
|BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)|

[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938415#comment-16938415
 ] 

Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 10:49 AM:
--

*Tests*

We have made some write benchmark with _id values in UUID V1 format, and the 
write performance of Elasticsearch is as follows:
||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
after 8h||CPU cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
 (+15.4%)|63.8
 (-6.7%)|+22.1%|31.5w/s
 (18.0%)|61.5
 (-7.7%)|+25.7%|

As shown above, after 8 hours of continuous writing, write speed improves by 
18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. 
The search API of Elasticsearch will also take benefit of this improvement.

It should be noted that the benchmark test needs to be run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.


was (Author: jgq2008303393):
*Tests*

We have made some write benchmark using _id in UUID V1 format, and the 
benchmark result is as follows:
||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
after 8h||CPU cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
 (+15.4%)|63.8
 (-6.7%)|+22.1%|31.5w/s
 (18.0%)|61.5
 (-7.7%)|+25.7%|

As shown above, after 8 hours of continuous writing, write speed improves by 
18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. 
The Get and Bulk API of Elasticsearch will also take benefit of this 
enhancement.

It should be noted that the benchmark test needs to be run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.

> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an indexed _id 
> field that uniquely identifies it. When Elasticsearch use the _id field to 
> find a document from Lucene, Lucene have to check all the segments of the 
> index. When the values of the _id field are very sequentially, the 
> performance is optimizable.
>  
> *Solution*
> Since Lucene stores min/maxTerm metrics for each segment and field, we can 
> use those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that we can 
> skip some useless segments as soon as possible.
>  
> This PR is beneficial to ES read/write API and Lucene look up API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
---
Description: 
*Description*

In Elasticsearch, which is based on Lucene, each document has an indexed _id 
field that uniquely identifies it. When Elasticsearch use the _id field to find 
a document from Lucene, Lucene have to check all the segments of the index. 
When the values of the _id field are very sequentially, the performance is 
optimizable.
 

*Solution*

Since Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
whether the term fall in the range of minTerm and maxTerm, so that we can skip 
some useless segments as soon as possible.
 
This PR is beneficial to ES read/write API and Lucene look up API.



  was:
*Description*

In Elasticsearch, which is based on Lucene, each document has an indexed _id 
field that uniquely identifies it. When Elasticsearch use the _id field to find 
a document from Lucene, Lucene have to check all the segments of the index. 
When the values of the _id field are very sequentially, the performance is 
optimizable.
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
whether the term fall in the range of minTerm and maxTerm, so that we can skip 
some useless segments as soon as possible.
 
This PR is beneficial to ES read/write API and Lucene look up API.




> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an indexed _id 
> field that uniquely identifies it. When Elasticsearch use the _id field to 
> find a document from Lucene, Lucene have to check all the segments of the 
> index. When the values of the _id field are very sequentially, the 
> performance is optimizable.
>  
> *Solution*
> Since Lucene stores min/maxTerm metrics for each segment and field, we can 
> use those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that we can 
> skip some useless segments as soon as possible.
>  
> This PR is beneficial to ES read/write API and Lucene look up API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
---
Description: 
*Description*

In Elasticsearch, which is based on Lucene, each document has an indexed _id 
field that uniquely identifies it. When Elasticsearch use the _id field to find 
a document from Lucene, Lucene have to check all the segments of the index. 
When the values of the _id field are very sequentially, the performance is 
optimizable.
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
whether the term fall in the range of minTerm and maxTerm, so that we can skip 
some useless segments as soon as possible.
 
This PR is beneficial to ES read/write API and Lucene look up API.



  was:
*Description*
In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it. The _id field is indexed so that each document can be 
looked up from Lucene. When users write documents with sequentially _id values, 
Elasticsearch lookup up t from check _id uniqueness through Lucene API for each 
document, which result in poor write performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 




> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an indexed _id 
> field that uniquely identifies it. When Elasticsearch use the _id field to 
> find a document from Lucene, Lucene have to check all the segments of the 
> index. When the values of the _id field are very sequentially, the 
> performance is optimizable.
>  
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in an index, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that we can 
> skip some useless segments as soon as possible.
>  
> This PR is beneficial to ES read/write API and Lucene look up API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
---
Description: 
*Description*
In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it. The _id field is indexed so that each document can be 
looked up from Lucene. When users write documents with sequentially _id values, 
Elasticsearch lookup up t from check _id uniqueness through Lucene API for each 
document, which result in poor write performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 



  was:
*Description*
In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it. The _id field is indexed so that each document can be 
looked up from Lucene. When users write data with sequentially _id values, 
Elasticsearch has to check _id uniqueness through Lucene API for each document, 
which result in poor write performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 




> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an _id field 
> that uniquely identifies it. The _id field is indexed so that each document 
> can be looked up from Lucene. When users write documents with sequentially 
> _id values, Elasticsearch lookup up t from check _id uniqueness through 
> Lucene API for each document, which result in poor write performance. 
>  
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that wo skip 
> some useless segments as soon as possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
---
Description: 
*Description*
In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it. The _id field is indexed so that each document can be 
looked up from Lucene. When users write data with sequentially _id values, 
Elasticsearch has to check _id uniqueness through Lucene API for each document, 
which result in poor write performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 



  was:
*Description*

In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it, which is indexed so that documents can be looked up 
from Lucene. When users write data with self-generated _id values, even if the 
conflict rate is very low, Elasticsearch has to check _id uniqueness through 
Lucene API for each document, which result in poor write performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 




> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an _id field 
> that uniquely identifies it. The _id field is indexed so that each document 
> can be looked up from Lucene. When users write data with sequentially _id 
> values, Elasticsearch has to check _id uniqueness through Lucene API for each 
> document, which result in poor write performance. 
>  
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that wo skip 
> some useless segments as soon as possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
---
Description: 
*Description*

In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it, which is indexed so that documents can be looked up 
from Lucene. When users write data with self-generated _id values, even if the 
conflict rate is very low, Elasticsearch has to check _id uniqueness through 
Lucene API for each document, which result in poor write performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 



  was:
*Description*

In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it, which is indexed so that documents can be looked up 
from Lucene. When users write Elasticsearch with self-generated _id values, 
even if the conflict rate is very low, Elasticsearch has to check _id 
uniqueness through Lucene API for each document, which result in poor write 
performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 




> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an _id field 
> that uniquely identifies it, which is indexed so that documents can be looked 
> up from Lucene. When users write data with self-generated _id values, even if 
> the conflict rate is very low, Elasticsearch has to check _id uniqueness 
> through Lucene API for each document, which result in poor write performance. 
>  
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that wo skip 
> some useless segments as soon as possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8928) BKDWriter could make splitting decisions based on the actual range of values

2019-09-26 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938429#comment-16938429
 ] 

Ignacio Vera commented on LUCENE-8928:
--

Run some benchmarks by comparing this new approach with the previous approach 
shown a similar query performance but a much faster indexing rate:

||Approach||Index time (sec)||Index time (sec)|| ||Force merge time 
(sec)||Force merge time (sec)|| ||Index size (GB)||Index size (GB)|| ||Reader 
heap (MB)||Reader heap (MB)||
|| ||Dev||Base||Diff||Dev||Base||diff||Dev||Base||Diff||Dev||Base||Diff||
|geo3d|163.5s|218.4s|-25%|0.0s|0.0s| 0%|0.71|0.71|-0%|1.75|1.75|-0%|
|shapes|227.8s|319.6s|-29%|0.0s|0.0s| 0%|1.27|1.27| 0%|1.78|1.78| 0%|






||Approach||Shape||M hits/sec||M hits/sec|| ||QPS  ||QPS ||   ||Hit 
count  ||Hit count|| 
 ||  ||  ||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||
|geo3d|box|55.58|57.53|-3%|56.56|58.54|-3%|221118844|221118844| 0%|
|geo3d|polyRussia|0.56|0.56|-1%|0.16|0.16|-1%|3508671|3508671| 0%|
|geo3d|poly 10|48.87|51.25|-5%|30.90|32.41|-5%|355855227|355855227| 0%|
|geo3d|polyMedium|0.62|0.63|-1%|7.64|7.67|-1%|2693545|2693545| 0%|
|geo3d|distance|68.16|69.70|-2%|40.00|40.91|-2%|383371884|383371884| 0%|
|shapes|box|45.99|46.52|-1%|46.80|47.34|-1%|221118844|221118844| 0%|
|shapes|polyRussia|6.64|7.01|-5%|1.89|2.00|-5%|3508846|3508846| 0%|
|shapes|poly 10|33.40|34.69|-4%|21.12|21.93|-4%|355809475|355809475| 0%|
|shapes|polyMedium|3.07|3.30|-7%|37.62|40.43|-7%|2693559|2693559| 0%|

> BKDWriter could make splitting decisions based on the actual range of values
> 
>
> Key: LUCENE-8928
> URL: https://issues.apache.org/jira/browse/LUCENE-8928
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently BKDWriter assumes that splitting on one dimension has no effect on 
> values in other dimensions. While this may be ok for geo points, this is 
> usually not true for ranges (or geo shapes, which are ranges too). Maybe we 
> could get better indexing by re-computing the range of values on each 
> dimension before making the choice of the split dimension?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938415#comment-16938415
 ] 

Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 9:12 AM:
-

*Tests*

We have made some write benchmark using _id in UUID V1 format, and the 
benchmark result is as follows:
||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
after 8h||CPU cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
 (+15.4%)|63.8
 (-6.7%)|+22.1%|31.5w/s
 (18.0%)|61.5
 (-7.7%)|+25.7%|

As shown above, after 8 hours of continuous writing, write speed improves by 
18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. 
The Get and Bulk API of Elasticsearch will also take benefit of this 
enhancement.

It should be noted that the benchmark test needs to be run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.


was (Author: jgq2008303393):
*Tests*

I have made some write benchmark using _id in UUID V1 format, and the benchmark 
result is as follows:
||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
after 8h||CPU cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
(+15.4%)|63.8
(-6.7%)|+22.1%|31.5w/s
(18.0%)|61.5
(-7.7%)|+25.7%|

As shown above, after 8 hours of continuous writing, write speed improves by 
18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. 
The Get and Bulk API of Elasticsearch will also take benefit of this 
enhancement.

It should be noted that the benchmark test needs to be run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.

> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an _id field 
> that uniquely identifies it, which is indexed so that documents can be looked 
> up from Lucene. When users write Elasticsearch with self-generated _id 
> values, even if the conflict rate is very low, Elasticsearch has to check _id 
> uniqueness through Lucene API for each document, which result in poor write 
> performance. 
>  
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that wo skip 
> some useless segments as soon as possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416
 ] 

Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 9:11 AM:
-

We have done more performance test using _luceneutil_ tool. And the complete 
test results are 
[here|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js].

The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The 
following table is the result of the last run. As shown in the table below, 
most of the indicators are basically stable, while the _PKLookup_ indicator has 
a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch 
will also take benefit of this enhancement.
|TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)|
|HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)|
|MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)|
|LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)|
|AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)|
|BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)|
|HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)|
|Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)|
|LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)|
|Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)|
|AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)|
|OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)|
|AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)|
|LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)|
|IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)|
|HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)|
|Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)|
|OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)|
|BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)|
|Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)|
|MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)|
|MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)|
|HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)|
|BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)|
|Respell|187.05|(12.2%)|189.48|(10.1%)|1.3%(-18% - 26%)|
|MedSloppyPhrase|583.45|(11.3%)|592.32|(9.9%)|1.5%(-17% - 25%)|
|HighTerm|1114.87|(12.0%)|1131.89|(12.8%)|1.5%(-20% - 29%)|
|HighTermDayOfYearSort|408.17|(13.1%)|416.13|(9.3%)|1.9%(-18% - 27%)|
|BrowseDayOfYearTaxoFacets|5460.05|(8.5%)|5591.96|(8.0%)|2.4%(-13% - 20%)|
|BrowseMonthTaxoFacets|5490.18|(8.0%)|5654.03|(9.3%)|3.0%(-13% - 22%)|
|LowSloppyPhrase|562.96|(10.1%)|583.91|(9.5%)|3.7%(-14% - 25%)|
|HighPhrase|221.20|(11.9%)|229.85|(12.2%)|3.9%(-17% - 31%)|
|OrHighMed|352.09|(12.3%)|369.39|(9.4%)|4.9%(-14% - 30%)|
|PKLookup|85.19|(18.1%)|135.38|(22.7%)|58.9%( 15% - 121%)|


was (Author: jgq2008303393):
We have done more performance test using _luceneutil_ tool. And the complete 
test results are 
[here|[https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js].

The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The 
following table is the result of the last run. As shown in the table below, 
most of the indicators are basically stable, while the _PKLookup_ indicator has 
a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch 
will also take benefit of this enhancement.
|TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)|
|HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)|
|MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)|
|LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)|
|AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)|
|BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)|
|HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)|
|Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)|
|LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)|
|Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)|
|AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)|
|OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)|
|AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)|
|LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)|
|IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)|
|HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)|
|Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)|
|OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)|
|BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)|
|Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)|
|MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)|
|MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 

[jira] [Comment Edited] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416
 ] 

Guoqiang Jiang edited comment on LUCENE-8980 at 9/26/19 9:11 AM:
-

We have done more performance test using _luceneutil_ tool. And the complete 
test results are 
[here|[https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js]|https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js].

The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The 
following table is the result of the last run. As shown in the table below, 
most of the indicators are basically stable, while the _PKLookup_ indicator has 
a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch 
will also take benefit of this enhancement.
|TaskQPS|baseline|StdDevQPS|my_modified_version|StdDev|Pct_diff(percent_diff)|
|HighIntervalsOrdered|303.36|(12.5%)|283.86|(16.9%)|-6.4%(-31% - 26%)|
|MedPhrase|404.26|(12.3%)|382.64|(10.5%)|-5.3%(-25% - 19%)|
|LowTerm|2302.28|(8.7%)|2180.74|(11.8%)|-5.3%(-23% - 16%)|
|AndHighMed|618.78|(10.1%)|586.61|(11.8%)|-5.2%(-24% - 18%)|
|BrowseDayOfYearSSDVFacets|1042.68|(10.1%)|992.82|(10.7%)|-4.8%(-23% - 17%)|
|HighSpanNear|263.62|(12.9%)|256.07|(14.9%)|-2.9%(-27% - 28%)|
|Wildcard|221.10|(16.2%)|215.32|(11.9%)|-2.6%(-26% - 30%)|
|LowSpanNear|656.60|(7.9%)|639.77|(11.3%)|-2.6%(-20% - 18%)|
|Fuzzy1|135.61|(9.1%)|132.26|(10.4%)|-2.5%(-20% - 18%)|
|AndHighHigh|409.88|(10.9%)|399.79|(12.6%)|-2.5%(-23% - 23%)|
|OrHighHigh|318.45|(12.9%)|312.43|(12.2%)|-1.9%(-23% - 26%)|
|AndHighLow|937.17|(10.2%)|921.71|(11.4%)|-1.6%(-21% - 22%)|
|LowPhrase|385.06|(12.3%)|379.83|(10.8%)|-1.4%(-21% - 24%)|
|IntNRQ|618.69|(14.1%)|610.58|(10.6%)|-1.3%(-22% - 27%)|
|HighTermMonthSort|1178.14|(9.5%)|1164.48|(12.6%)|-1.2%(-21% - 23%)|
|Fuzzy2|46.95|(16.2%)|46.57|(15.6%)|-0.8%(-28% - 36%)|
|OrHighLow|633.64|(9.6%)|629.21|(9.9%)|-0.7%(-18% - 20%)|
|BrowseMonthSSDVFacets|1157.34|(12.1%)|1155.63|(13.5%)|-0.1%(-23% - 29%)|
|Prefix3|297.40|(12.1%)|298.16|(12.7%)|0.3%(-21% - 28%)|
|MedSpanNear|434.56|(10.0%)|437.02|(11.4%)|0.6%(-19% - 24%)|
|MedTerm|2158.68|(8.8%)|2177.67|(11.1%)|0.9%(-17% - 22%)|
|HighSloppyPhrase|320.36|(10.0%)|323.46|(14.6%)|1.0%(-21% - 28%)|
|BrowseDateTaxoFacets|2065.89|(13.7%)|2088.22|(13.2%)|1.1%(-22% - 32%)|
|Respell|187.05|(12.2%)|189.48|(10.1%)|1.3%(-18% - 26%)|
|MedSloppyPhrase|583.45|(11.3%)|592.32|(9.9%)|1.5%(-17% - 25%)|
|HighTerm|1114.87|(12.0%)|1131.89|(12.8%)|1.5%(-20% - 29%)|
|HighTermDayOfYearSort|408.17|(13.1%)|416.13|(9.3%)|1.9%(-18% - 27%)|
|BrowseDayOfYearTaxoFacets|5460.05|(8.5%)|5591.96|(8.0%)|2.4%(-13% - 20%)|
|BrowseMonthTaxoFacets|5490.18|(8.0%)|5654.03|(9.3%)|3.0%(-13% - 22%)|
|LowSloppyPhrase|562.96|(10.1%)|583.91|(9.5%)|3.7%(-14% - 25%)|
|HighPhrase|221.20|(11.9%)|229.85|(12.2%)|3.9%(-17% - 31%)|
|OrHighMed|352.09|(12.3%)|369.39|(9.4%)|4.9%(-14% - 30%)|
|PKLookup|85.19|(18.1%)|135.38|(22.7%)|58.9%( 15% - 121%)|


was (Author: jgq2008303393):
We have done more performance test using _luceneutil_ tool. And the complete 
test results are 
[here](https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js).

The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The 
following table is the result of the last run. As shown in the table below, 
most of the indicators are basically stable, while the _PKLookup_ indicator has 
a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch 
will also take benefit of this enhancement.


|   TaskQPS   |baseline|StdDevQPS|my_modified_version|  StdDev   
|Pct_diff(percent_diff)|
|HighIntervalsOrdered | 303.36 | (12.5%) |  283.86   |  (16.9%)  |  
-6.4%(-31% -  26%)  |
|   MedPhrase | 404.26 | (12.3%) |  382.64   |  (10.5%)  |  
-5.3%(-25% -  19%)  |
| LowTerm |2302.28 |  (8.7%) | 2180.74   |  (11.8%)  |  
-5.3%(-23% -  16%)  |
|  AndHighMed | 618.78 | (10.1%) |  586.61   |  (11.8%)  |  
-5.2%(-24% -  18%)  |
|BrowseDayOfYearSSDVFacets|1042.68 | (10.1%) |  992.82   |  (10.7%)  |  
-4.8%(-23% -  17%)  |
|HighSpanNear | 263.62 | (12.9%) |  256.07   |  (14.9%)  |  
-2.9%(-27% -  28%)  |
|Wildcard | 221.10 | (16.2%) |  215.32   |  (11.9%)  |  
-2.6%(-26% -  30%)  |
| LowSpanNear | 656.60 |  (7.9%) |  639.77   |  (11.3%)  |  
-2.6%(-20% -  18%)  |
|  Fuzzy1 | 135.61 |  (9.1%) |  132.26   |  (10.4%)  |  
-2.5%(-20% -  18%)  |
| AndHighHigh | 409.88 | (10.9%) |  399.79   |  (12.6%)  |  
-2.5%(-23% -  23%)  |
|  OrHighHigh | 318.45 | (12.9%) |  312.43   |  (12.2%)  |  
-1.9%(-23% -  26%)  |
|  AndHighLow | 937.17 | (10.2%) |  921.71   |  (11.4%)  |  
-1.6%(-21% -  22%)  |
|   LowPhrase | 

[jira] [Commented] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938423#comment-16938423
 ] 

Guoqiang Jiang commented on LUCENE-8980:


We run another test case _wikimedium10m _to verify the improvement on a large 
data set. The complete results are 
[here|https://gist.github.com/jgq2008303393/44768d69a843c7b421e765bbab9360fd.js].
  The following table is the result of the last run:

|TaskQPS  |baseline|StdDevQPS|my_modified_version| StdDev 
|Pct_diff(percent_diff)|
| --- | :: | :-: | :---: | :: | 
:--: |
|OrHighNotLow | 293.93 | (5.8%)  |  286.46   | (6.6%) | 
-2.5%(-14% - 10%)|
|   OrHighNotHigh | 258.18 | (3.7%)  |  252.41   | (5.0%) | 
-2.2%(-10% -  6%)|
|   OrHighLow | 206.52 | (6.2%)  |  202.55   | (6.2%) | 
-1.9%(-13% - 11%)|
|   MedPhrase |  16.41 | (4.1%)  |   16.12   | (2.6%) | 
-1.7%( -8% -  5%)|
| LowTerm | 608.71 | (5.7%)  |  599.21   | (4.4%) | 
-1.6%(-10% -  9%)|
| Prefix3 |  37.96 | (2.8%)  |   37.51   | (3.8%) | 
-1.2%( -7% -  5%)|
|   OrNotHighHigh | 255.49 | (5.5%)  |  252.63   | (6.1%) | 
-1.1%(-12% - 11%)|
| MedSloppyPhrase |  13.71 | (3.5%)  |   13.58   | (3.7%) | 
-1.0%( -7% -  6%)|
|HighSloppyPhrase |  17.00 | (3.3%)  |   16.84   | (3.7%) | 
-0.9%( -7% -  6%)|
|  OrHighHigh |  19.02 | (2.6%)  |   18.85   | (2.7%) | 
-0.9%( -6% -  4%)|
| MedTerm | 564.56 | (4.6%)  |  559.38   | (2.9%) | 
-0.9%( -8% -  6%)|
|OrNotHighLow | 294.29 | (4.9%)  |  291.86   | (4.2%) | 
-0.8%( -9% -  8%)|
|  AndHighLow | 303.17 | (3.7%)  |  300.72   | (4.5%) | 
-0.8%( -8% -  7%)|
| AndHighHigh |  28.24 | (2.1%)  |   28.01   | (2.7%) | 
-0.8%( -5% -  4%)|
|Wildcard |  64.64 | (3.9%)  |   64.21   | (4.0%) | 
-0.7%( -8% -  7%)|
|HighSpanNear |  15.14 | (2.8%)  |   15.04   | (2.5%) | 
-0.7%( -5% -  4%)|
|HighTerm | 431.22 | (3.9%)  |  428.68   | (2.9%) | 
-0.6%( -7% -  6%)|
| LowSloppyPhrase |  19.29 | (2.2%)  |   19.18   | (2.9%) | 
-0.6%( -5% -  4%)|
| LowSpanNear |  64.32 | (2.3%)  |   63.99   | (2.0%) | 
-0.5%( -4% -  3%)|
|  Fuzzy2 |  34.51 |(12.8%)  |   34.34   |(11.9%) | 
-0.5%(-22% - 27%)|
| MedSpanNear |  51.51 | (2.3%)  |   51.28   | (1.6%) | 
-0.4%( -4% -  3%)|
|   HighTermDayOfYearSort |  51.45 | (6.6%)  |   51.24   | (7.5%) | 
-0.4%(-13% - 14%)|
|OrHighNotMed | 306.95 | (5.1%)  |  306.03   | (3.2%) | 
-0.3%( -8% -  8%)|
|BrowseDateTaxoFacets |   1.48 | (0.6%)  |1.47   | (1.2%) | 
-0.2%( -1% -  1%)|
|   BrowseMonthSSDVFacets |   6.15 | (1.1%)  |6.14   | (3.6%) | 
-0.2%( -4% -  4%)|
|  HighPhrase | 186.86 | (6.2%)  |  186.64   | (3.7%) | 
-0.1%( -9% - 10%)|
| Respell |  48.69 | (4.1%)  |   48.65   | (4.0%) | 
-0.1%( -7% -  8%)|
|  AndHighMed |  65.66 | (3.0%)  |   65.74   | (3.2%) |  
0.1%( -5% -  6%)|
|HighIntervalsOrdered |   6.68 | (1.5%)  |6.69   | (1.7%) |  
0.1%( -3% -  3%)|
|   LowPhrase | 219.11 | (5.7%)  |  220.24   | (3.5%) |  
0.5%( -8% - 10%)|
|   OrHighMed |  68.05 | (4.5%)  |   68.44   | (3.1%) |  
0.6%( -6% -  8%)|
|OrNotHighMed | 272.89 | (5.7%)  |  274.77   | (4.1%) |  
0.7%( -8% - 11%)|
|  IntNRQ |  37.58 |(23.8%)  |   37.96   |(24.2%) |  
1.0%(-37% - 64%)|
|BrowseDayOfYearSSDVFacets|   5.34 | (4.2%)  |5.40   | (2.9%) |  
1.2%( -5% -  8%)|
|   HighTermMonthSort |  34.82 |(11.7%)  |   35.81   |(14.9%) |  
2.9%(-21% - 33%)|
|   BrowseMonthTaxoFacets |4781.41 | (3.9%)  | 4931.19   | (2.7%) |  
3.1%( -3% - 10%)|
|  Fuzzy1 |  35.98 | (9.7%)  |   37.42   | (8.0%) |  
4.0%(-12% - 23%)|
|BrowseDayOfYearTaxoFacets|4688.64 | (3.6%)  | 4878.52   | (3.6%) |  
4.0%( -3% - 11%)|
|PKLookup |  72.93 | (4.7%)  |   95.23   | (3.3%) | 
30.6%( 21% - 40%)|


> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>

[jira] [Commented] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938416#comment-16938416
 ] 

Guoqiang Jiang commented on LUCENE-8980:


We have done more performance test using _luceneutil_ tool. And the complete 
test results are 
[here](https://gist.github.com/jgq2008303393/42d536f44b4845c01329a402202273eb.js).

The _lueneutil_ tool repeatedly execute the _wikimedium10k_ 20 times. The 
following table is the result of the last run. As shown in the table below, 
most of the indicators are basically stable, while the _PKLookup_ indicator has 
a performance improvement of 58.7%. The _Get_ and _Bulk_ API of Elasticsearch 
will also take benefit of this enhancement.


|   TaskQPS   |baseline|StdDevQPS|my_modified_version|  StdDev   
|Pct_diff(percent_diff)|
|HighIntervalsOrdered | 303.36 | (12.5%) |  283.86   |  (16.9%)  |  
-6.4%(-31% -  26%)  |
|   MedPhrase | 404.26 | (12.3%) |  382.64   |  (10.5%)  |  
-5.3%(-25% -  19%)  |
| LowTerm |2302.28 |  (8.7%) | 2180.74   |  (11.8%)  |  
-5.3%(-23% -  16%)  |
|  AndHighMed | 618.78 | (10.1%) |  586.61   |  (11.8%)  |  
-5.2%(-24% -  18%)  |
|BrowseDayOfYearSSDVFacets|1042.68 | (10.1%) |  992.82   |  (10.7%)  |  
-4.8%(-23% -  17%)  |
|HighSpanNear | 263.62 | (12.9%) |  256.07   |  (14.9%)  |  
-2.9%(-27% -  28%)  |
|Wildcard | 221.10 | (16.2%) |  215.32   |  (11.9%)  |  
-2.6%(-26% -  30%)  |
| LowSpanNear | 656.60 |  (7.9%) |  639.77   |  (11.3%)  |  
-2.6%(-20% -  18%)  |
|  Fuzzy1 | 135.61 |  (9.1%) |  132.26   |  (10.4%)  |  
-2.5%(-20% -  18%)  |
| AndHighHigh | 409.88 | (10.9%) |  399.79   |  (12.6%)  |  
-2.5%(-23% -  23%)  |
|  OrHighHigh | 318.45 | (12.9%) |  312.43   |  (12.2%)  |  
-1.9%(-23% -  26%)  |
|  AndHighLow | 937.17 | (10.2%) |  921.71   |  (11.4%)  |  
-1.6%(-21% -  22%)  |
|   LowPhrase | 385.06 | (12.3%) |  379.83   |  (10.8%)  |  
-1.4%(-21% -  24%)  |
|  IntNRQ | 618.69 | (14.1%) |  610.58   |  (10.6%)  |  
-1.3%(-22% -  27%)  |
|   HighTermMonthSort |1178.14 |  (9.5%) | 1164.48   |  (12.6%)  |  
-1.2%(-21% -  23%)  |
|  Fuzzy2 |  46.95 | (16.2%) |   46.57   |  (15.6%)  |  
-0.8%(-28% -  36%)  |
|   OrHighLow | 633.64 |  (9.6%) |  629.21   |   (9.9%)  |  
-0.7%(-18% -  20%)  |
|   BrowseMonthSSDVFacets |1157.34 | (12.1%) | 1155.63   |  (13.5%)  |  
-0.1%(-23% -  29%)  |
| Prefix3 | 297.40 | (12.1%) |  298.16   |  (12.7%)  |  
 0.3%(-21% -  28%)  |
| MedSpanNear | 434.56 | (10.0%) |  437.02   |  (11.4%)  |  
 0.6%(-19% -  24%)  |
| MedTerm |2158.68 |  (8.8%) | 2177.67   |  (11.1%)  |  
 0.9%(-17% -  22%)  |
|HighSloppyPhrase | 320.36 | (10.0%) |  323.46   |  (14.6%)  |  
 1.0%(-21% -  28%)  |
|BrowseDateTaxoFacets |2065.89 | (13.7%) | 2088.22   |  (13.2%)  |  
 1.1%(-22% -  32%)  |
| Respell | 187.05 | (12.2%) |  189.48   |  (10.1%)  |  
 1.3%(-18% -  26%)  |
| MedSloppyPhrase | 583.45 | (11.3%) |  592.32   |   (9.9%)  |  
 1.5%(-17% -  25%)  |
|HighTerm |1114.87 | (12.0%) | 1131.89   |  (12.8%)  |  
 1.5%(-20% -  29%)  |
|   HighTermDayOfYearSort | 408.17 | (13.1%) |  416.13   |   (9.3%)  |  
 1.9%(-18% -  27%)  |
|BrowseDayOfYearTaxoFacets|5460.05 |  (8.5%) | 5591.96   |   (8.0%)  |  
 2.4%(-13% -  20%)  |
|   BrowseMonthTaxoFacets |5490.18 |  (8.0%) | 5654.03   |   (9.3%)  |  
 3.0%(-13% -  22%)  |
| LowSloppyPhrase | 562.96 | (10.1%) |  583.91   |   (9.5%)  |  
 3.7%(-14% -  25%)  |
|  HighPhrase | 221.20 | (11.9%) |  229.85   |  (12.2%)  |  
 3.9%(-17% -  31%)  |
|   OrHighMed | 352.09 | (12.3%) |  369.39   |   (9.4%)  |  
 4.9%(-14% -  30%)  |
|PKLookup |  85.19 | (18.1%) |  135.38   |  (22.7%)  |  
58.9%( 15% - 121%)  |

> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an _id field 
> that uniquely identifies it, which 

[jira] [Updated] (LUCENE-8980) Optimise SegmentTermsEnum.seekExact performance

2019-09-26 Thread Guoqiang Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Jiang updated LUCENE-8980:
---
Description: 
*Description*

In Elasticsearch, which is based on Lucene, each document has an _id field that 
uniquely identifies it, which is indexed so that documents can be looked up 
from Lucene. When users write Elasticsearch with self-generated _id values, 
even if the conflict rate is very low, Elasticsearch has to check _id 
uniqueness through Lucene API for each document, which result in poor write 
performance. 
 

*Solution*

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 



  was:
*Description*

In Elasticsearch, each document has an _id field that uniquely identifies it, 
which is indexed so that documents can be looked up from Lucene. When users 
write Elasticsearch with self-generated _id values, even if the conflict rate 
is very low, Elasticsearch has to check _id uniqueness through Lucene API for 
each document, which result in poor write performance.

 

*Solution*

1. Choose a better _id generator before writing ES

Different _id formats have a great impact on write performance. We have 
verified this in production cluster. Users can refer to the following blog and 
choose a better _id generator.

[http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html]

2. Optimise with min/maxTerm metrics in Lucene

As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
 

*Tests*

I have made some write benchmark using _id in UUID V1 format, and the benchmark 
result is as follows:
||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
after 8h||CPU cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
(+15.4%)|63.8
(-6.7%)|+22.1%|31.5w/s
(18.0%)|61.5
(-7.7%)|+25.7%|

As shown above, after 8 hours of continuous writing, write speed improves by 
18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. 
The Elasticsearch GET API and ids query would get similar performance 
improvements.

It should be noted that the benchmark test needs to be run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.


> Optimise SegmentTermsEnum.seekExact performance
> ---
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.2
>Reporter: Guoqiang Jiang
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: performance
> Fix For: master (9.0)
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an _id field 
> that uniquely identifies it, which is indexed so that documents can be looked 
> up from Lucene. When users write Elasticsearch with self-generated _id 
> values, even if the conflict rate is very low, Elasticsearch has to check _id 
> uniqueness through Lucene API for each document, which result in poor write 
> performance. 
>  
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that wo skip 
> some useless segments as soon as possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-26 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938389#comment-16938389
 ] 

Bruno Roustant commented on LUCENE-8920:


Here is a proposal for the heuristic to select the encoding of a FST node.

The idea is to have threshold values that vary according to a parameter, let's 
call it "timeSpaceBalance". timeSpaceBalance can have 4 values: MORE_COMPACT, 
COMPACT (default, current FST), FAST, FASTER.
Keep the current FST encoding/behavior for COMPACT.
Only try open-addressing encoding for the FAST or FASTER balance.
Be very demanding for direct-addressing for COMPACT or MORE_COMPACT.
Do we need the MORE_COMPACT mode? Even if I don't see the use-case now, since 
it's easy and does not involve more code to have it, I would say yes.

4 rules, one per possible encoding, ordered from top to bottom. The first 
encoding which condition matches is selected.

n: number of labels (num sub-nodes)
depth: depth of the node in the tree

[list-encoding] if n <= L1 || (depth >= L2 && n <= L3)

[direct-addressing] if n / (max label - min label) >= D1

[try open-addressing] if depth <= O1 || n >= O2

[binary search] otherwise


And below are the threshold values for each timeSpaceBalance:

timeSpaceBalance = MORE_COMPACT (memory < x1)
L1 = 6, L2 = 4, L3 = 11
D1 = 0.8
O1 = -1, O2 = infinite

timeSpaceBalance = COMPACT (memory x1)
L1 = 4, L2 = 4, L3 = 9
D1 = 0.66
O1 = -1, O2 = infinite

timeSpaceBalance = FAST (memory x2 ?)
L1 = 4, L2 = 4, L3 = 7
D1 = 0.5
O1 = 3, O2 = 10

timeSpaceBalance = FASTER (memory x3 ?)
L1 = 3, L2 = 4, L3 = 5
D1 = 0.33
O1 = infinite, O2 = 0


Thoughts?

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #831: LUCENE-8949: Allow LeafFieldComparators to publish Feature Values

2019-09-26 Thread GitBox
atris commented on issue #831: LUCENE-8949: Allow LeafFieldComparators to 
publish Feature Values
URL: https://github.com/apache/lucene-solr/pull/831#issuecomment-535370985
 
 
   Hi @jpountz ,
   
   RE: this PR, I think it is a prerequisite for improvements like 
https://issues.apache.org/jira/browse/LUCENE-8988 and shared PQ based early 
termination.
   
   I was wondering if we could merge this PR and mark the API as experimental, 
along with a warning that it could be costly for some specific iterators. WDYT, 
please?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle Rejected Execution

2019-09-26 Thread GitBox
atris commented on issue #899: LUCENE-8989: Allow IndexSearcher To Handle 
Rejected Execution
URL: https://github.com/apache/lucene-solr/pull/899#issuecomment-535360687
 
 
   Any thoughts on this one? Seems safe enough to merge?
   
   I plan to merge it in another 12 hours from now -- unless any objections


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org